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Introduction 



Aim of this work is to investigate a system able to detect facial expressions and to 
use them in a model for automatic affect recognition, in order to further investigate 
models for social interactions mediated by social signals. 

Human computer interaction has undergone a great change during the last decades. 
Currently, thanks to new methods and technologies, we are able to give to the user 
the possibility to interact with systems simply using gestures and motion. 

A lot of applications in disparate fields are going into this direction, in particular 
videogames field. Several videogames propose an "intelligent" interaction with a 
virtual avatar using gestures or particular joysticks, however these interactions are 
not really sociable, as the avatar cannot understand the emotional state of the user 
and it is not able to establish believable social interactions. 

Humans are experts in social interactions. Therefore technologies able to adhere 
to the social expectations of the people, for example making use of the affective 
state of the user, are crucial in order to further improve the user experience in our 
everyday appliances [4]. To achieve this purpose, Affective Computing offers several 
techniques to extract the emotional state of the user from visual, audio and bio 
signals, with more or less precision and fortune [5, 10]; nevertheless in recent years 
several emotional applications born from disparate laboratories around the world. 

Thanks to the several theories of emotions enounced in current and past years, 
it is possible to have a useful guideline for the development of an affect recognition 
model. In this work, the theory of the core affect of Russell [41] plays a crucial 
role, since it allows to describe emotions as a set of latent variables, which defines 
a core affect space where the points describe all the possible emotional states of the 
subject. Conversely, other theories describe the emotions using a limited number of 
classes or labels. This labels own a too wide meaning, since they are created in a 
further step by human cognition in order to only help us to classify things. As with 
the colours, which are classified into classes by humans (red, blue, yellow, green, ...) 
but indeed their visible hue depends on the wavelength of the light source, emotions 
can be classified into classes (sadness, happiness, anger, ...) but their exact affective 
value depends on a small set of latent affective variables. 

Current researches on affect recognition [29] state that the observation owns an 
affective power describable in terms of affective variables. This in turn allows to 
regress a space on the basis of these specific set of chosen affective variables. In 
our vision the behaviour is exactly the opposite: it is the latent space to possess 
an affective power and to generate an observation starting from a specific point 
of the latent space, which maximizes the likelihood w.r.t. the observation under 
exam. Consequently this generated observation allows the subject to understand 
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the affective state of the partner making comparison with the affective character 
suggested by the point of the latent space, from which this observation is generated. 
This is for example one of the theories endorsed by mirror neurons theorists. 

In this way it is possible to investigate the topology of the latent space, with- 
out considering any kind of label on the elements of the database. This in turn 
implies that the topology of the latent space can be generated considering only the 
raw observations through unsupervised techniques of machine learning, for instance 
techniques of dimensionality reduction. 

For this reason in this work is proposed a non-linear dimensionality reduction 
model based on Gaussian Processes, namely the Gaussian Process Latent Variable 
Model [57]. 

In order to capture facial expressions from video streams necessary to train the 
model, an architecture of face detection and normalization is proposed. This ar- 
chitecture involves a face detector [66] and a tracker based on Kalman filter [69]. 
Furthermore, a facial landmark detector [76] is used to discover angles of rotation of 
the face and consequently its normalization. Finally, a procedure for light conditions 
normalization is used in order to remove noise due to ununiform light conditions [79] . 

The thesis includes some preliminary experiments to test the classification accu- 
racy of the model proposed. The tests are made using two kinds of facial expressions 
datasets: a dataset based on videos collected from the Web with no labels available, 
and a dataset created in a laboratory asset presenting several combinations of Action 
Units used to objectively describe the current facial configuration [81]. 

For the first dataset only a qualitative evaluation is proposed, for the reason that 
without labels it is not easy to produce objective and numerical results. However 
this first test allows to understand the behaviour of the model proposed and its first 
advantages and defects. 

For the second dataset an objective numerical evaluation is used to better inter- 
pret the results. Also in this case it is difficult to produce sound results. This is due 
to the fact that several issues do not allow an accurate temporal alignment of data 
used for the final comparison. Nevertheless, these results permit a first guideline for 
future improvements. 

The thesis is organized as follows: 

Chapter 1 describes the general domain in which our work is placed and develops 
the general idea that leads us to investigate the issues presented above; 

Chapter 2 provides a detailed review of emotions and theories of emotions devel- 
oped during the past years in order to have an accurate picture of this domain; 

Chapter 3 discusses the model chosen for our purpose of emotions recognition and 
introduces its theoretical foundations; 

Chapter 4 explains how it is possible to extract faces from videos and describes 
our architecture built to fulfil this task; 

Chapter 5 illustrates the tests made with our model and the relative results; 

Chapter 6 presents our conclusions and the future possible improvements. 



Chapter 1 
General overview 



A new idea is delicate. It can be killed by a sneer or a yawn; 
it can be stabbed to death by a quip and worried to death by 
a frown on the right man's brow. 

Ovid 



1.1 A social way of interaction 

Each instant of our life is a constant interaction with what we call reality. We can 
see because the cells inside our eyes are stimulated by light stimuli; we can touch 
thanks to the several nerve endings under our skin; we can hear, taste, smell, feel 
pain ... all using our sense organs. A combination of these sensorial experiences 
produces the reality in which we are surrounded. 

For this reason we are able to interact with our appliances and use these for the 
everyday life. However, if we look at the world of computers and digital environments 
there is the necessity to create interfaces that allow communication and interaction 
between the real world and the digital one. There are available several kinds of 
interfaces, and each one of these produces a model of interaction. Then, what kind 
of interface and consequently model of interaction is the best that fits user's needs? 

Answering this question is not a so easy task: to prove it there exist several 
disciplines such as Human- Computer Interaction, Psychology, Design, etc... that 
try together to respond this question since several decades without giving a unique 
answer. However we can consider a fact. Humans are social animals and for this 
reason they spend most of their time interacting with one another. For us it is easy 
to communicate and interact with someone using natural language, prosody, facial 
expressions, gesture and so on, for the reason that we grew up through these forms 
of natural interactions and we exploited them along learning. 

Accordingly, these modes of interaction have the advantage of being usual, com- 
fortable and to enhance affordance, so they are preferable in most situations. More- 
over, unlike non-natural user interfaces, natural user interfaces allow to extract some 
important cues to use into applications for increasing their value to the user and meet 
more consumers' needs. 
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At this point it is necessary to introduce and define the term anthropomorphism. 
In general this word refers to the tendency to attribute humanlike characteristics, 
intentions, and behaviour to nonhuman objects [1]. Psychological studies suggest 
that increasing accessibility to the human schema 1 results in anthropomorphism [2] , 
that means for example that an object (real or digital) imitating human behaviours 
or appearance increases accessibility of human schema, and consequently increase 
anthropomorphism. Obviously is not necessary to anthropomorphize an object in 
order to use natural interactions on it, in fact to fulfil this purpose we need only a 
set of natural user interfaces, however if we want to create a greater illusion of a 
social interaction between the user and the real or digital thing, it is crucial to give 
an anthropomorphic vision of the artefact. 

This practice is necessary in order to create a metaphor that allows people to 
use interfaces taking advantage of previous cognitive model learned with the aim to 
fulfil tasks related to metaphor itself. If an object is endowed with anthropomor- 
phic characteristics it is more likely that people interact with it by using modes of 
interaction analogous to those used for human-human interaction in a social domain 
(Fig. 1.1). 

The use of metaphors is a 
fundamental part of our rea- 
soning. A metaphor is de- 
fined as a mapping from a set 
of correspondences between a 
source domain to a target do- 
main. These correspondences 
allow us to reason in the target 
domain using the knowledge we 
have on the source domain. For 
example we use the knowledge 
about the classical mail to cre- 
ate an useful mapping to elec- 
tronic mail domain [3]. 

Indeed, humans are experts 
in social interactions. There- 
fore, if technology adheres to 
the social expectations of the 

users, users will find the interaction enjoyable and they will experience stronger 
feelings more congruent with their expectations. It has been shown [4] that people 
prefer to interact with machines in the same manner in which they interact with 
other people. Thus, it is useful to study the implementation of a metaphor that 
maps the correspondences from the domain of the machine to that of a common 
social interaction. This is the purpose of this work, as it will be explained in later 
sections. 




Fig. 1.1: Asimo, an example of anthropomorphic 
robot interacting with people 



1 Set of characteristics, models and behaviours strictly related to humans 



CHAPTER 1. GENERAL OVERVIEW 



1.2 What is Affective Computing? 

Affective Computing is a quite new research area at the intersection between Psy- 
chology and Computer Science, originated with Rosalind W. Picard [5] at MIT, who 
framed it as follows: 

Computing that relates to, arises from or deliberately influences emo- 
tions. 

As it can be seen by the definition, the role of emotions in this field is crucial. 
However, why a research theme typical of Psychology could be important for a 
research area of Computer Science? 

Actually there are several cases of psychological themes that were investigated 
by computer scientists, but it is interesting how emotions, that in the mainstream 
of Western culture is deemed to be ruled by irrational processes, could be of interest 
in a field traditionally governed by logic, determinism and rationality. To better 
understand this strange combination, consider the following investigations about 
emotions and their relation with perception and cognition. 

A study by Cytowic [6] has investigated synesthetic experience of individuals. A 
synesthetic experience consists in associating an experience felt with a sense organ, 
with another feeling typical of a different sense organ, for instance "seeing" colours 
while hearing music. Cytowic investigated the behaviour of the cortex during a 
synesthetic episode. As result, an overall increase of brain metabolism occurred 
in the limbic system, and not in the higher cortex, where it was expected. The 
limbic system has traditionally been assumed as the set of brain regions supporting 
emotion, memory and attention (although the very concept of limbic system has 
been recently questioned, cfr. LeDoux [7] for a deeper discussion). Its activity 
during synesthesia shows that the limbic system has a crucial role in perception. 
Indeed, it is common to perceive the world around us on the basis of our emotional 
state, for instance the reality it could be see through rose-coloured glasses during a 
joyful state. 

Findings from recent studies have provided evidence that no clear cut can be 
drawn between emotion and perception (and more generally cognition, see Pessoa [8] 
for an in-depth discussion). It is worth recalling the important study by Damasio [9]. 
Damasio's patients have injuries in the part of the cortex (orbitofrontal cortex, 
OFC) that communicates with the amygdala (a region of the limbic system). These 
lesions involve the inability to regulate interactions between the emotional responses 
and cortical decision-making structures. Thus, Damasio's patients appear to be 
intelligent and very rational, but they are actually unable to make decisions or they 
spend too much time compared to healthy subjects, also if the decision to take is 
simple. Damasio supports the hypothesis that emotions regulate decision-making 
as a necessary bias (the Somatic Marker Hypothesis, SMH) to evaluate potential 
outcomes, and prevent an infinite logical search. 

The implications of such findings are significant also for computer science and 
industry: computers, if they are to be truly effective at decision-making, will have to 
be endowed with emotion-like mechanisms working in concert with their rule-based 
systems [5]. 
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1.2.1 Domains and applications 

Like most areas of computer science, Affective Computing can be used in a wide 
range of domains and applications, relying on sensing and recognizing user emo- 
tion [10], or on generating expressive affective behaviours in synthetic agents and 
robots [11]. 

One remarkable example is the entertainment domain, and a possible application 
concerns videogames (Affective Gaming). This new form of videogames exploits the 
affective state of the user to calibrate and to manage the game difficulty and/or 
the gameplay. Current focus in Affective Gaming is primarily on the sensing and 
recognition of the players' emotions, and on tailoring the game responses to these 
emotions; e.g., minimizing frustration, ensuring appropriate challenge [12]. 

There are several reasons to use emotions as a way of input in videogames. For 
instance the game could adapt history and events on the basis of the affective state 
of the player. An example of this feature could be a loud noise in an horror based 
game when the player is incredibly tense to augment the fear of the user and give 
to it a best game experience [13]. 

More generally affective computing could be crucial in the future of human- 
computer interaction. As discussed in Section 1.1, humans have a bias to threat 
things as people. By creating new forms of interaction based on emotions it is 
possible to enhance the overall quality of user experience. An example is provided 
by embodied conversational agents (ECA), namely interfaces governed by virtual or 
robotic agents that express and recognize the affective state of the user and use this 
cue to help the user during the interaction with the application (see Isbister [14] for 
more information). A primordial example of ECA is the Windows Office assistant 
Clippy, although it was not able to use or express emotions. 

Another domain of interest is represented by "technologies as means of persua- 
sion" that aim at changing the behaviour, feelings and attitudes of users [15]. The 
health domain is one among the most interesting fields where Affective Computing 
can be effectively applied to improve user's security and health. Robots or virtual 
avatars can be used in therapies, such as the treatment of autism [16]. Other appli- 
cations can be those concerning elderly people, especially those who live alone [17]. 

No less important are applications concerning decision-making and security. As 
previously mentioned, emotions could be crucial to improve the quality of decisions 
in an intelligent artificial system. Further, the use of emotions recognition could 
be central to those applications involved in security, for example to detect acts of 
violence or as tutoring systems to prevent loss of situation awareness by the user 
along critical episodes [18]. 

Many other possible domains and applications can be devised, where Affective 
Computing might play a role: the limit only is in the creativity and fantasy of 
researchers. 
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1.3 The concept 

1.3.1 A natural learning process 

A disapproving glance can turn us in a bad mood, while when we are praised, we 
feel positive. Following Damasio [9], negative feelings will prevent the individual 
from falling back in disagreeable (mental and physical) situations, as opposed to 
positive feelings that associate a given event with a profitable outcome [19]. On 
such basis, emotions are able to regulate the learning process and decision-making 
of individuals. 

There has been a time when we were babies and we did not know exactly the 
meaning of words. Our parents helped us to discover the world around us by indi- 
cating things and suggesting the appropriate names. They spoke with a particular 
intonation and acted facial expressions so to suggest the emotional valence of the 
scrutinized objects. This kind of progressive learning was not limited to things, but 
also extended to events and behaviours, so that we could learn the right way to 
act [20]. 

The same process of learning happens to pets too. When a new puppy comes 
in our house and for example destroys our shoes, we severely rebuke it to avoid the 
same behaviour in the future. Infants and puppies are not the only cases of this 
form of learning: for instance, when adults fail at work, their boss warning might 
trigger embarrassment useful to prevent from the same error in the future. 

This form of learning requires a kind of social interaction based on emotional 
signals and it is crucial investigating a model for emotions representation and relative 
dynamics during several social interactions. Next sections will try to better illustrate 
the overall idea. 

1.3.2 An intelligent emotional avatar 

Drawing inspiration from this common process of learning, we propose to create a 
virtual avatar able to recognize the affective states of users in order to learn new 
actions and behaviours. To fulfil this purpose, a camera (eventually, a Kinect 2 
camera) and a microphone will be available in order to capture frames of the user's 
face, and cues of the user's speech. 

The avatar should be able to express emotions through facial expressions that 
derive from avatar's internal affective state, which is in turn regulated by the social 
interactions with the users. This behaviour is necessary as a feedback for the user. 

At the beginning the avatar will be endowed with a small subset of basic be- 
haviours. One of these will be the ability to mimic and eventually learn the actions 
of the user that, in this case, will be a new available action that the avatar could 
select in the future. On the basis of the affective state of the avatar a specific be- 
haviour will be taken and obviously the user could punish or reward the avatar, 
enabling him to recognize what kind of behaviour is better to take under specific 
circumstances (Fig. 1.2). 



2 www . kinectforwindows .org/ 
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(1) Having a social interaction 




(2) 

Extract affective cues 

from social interaction' 



(3)> 
Change internal affective state 

m 

Select an action, perform it and produce a 

facial expression consistent with the internal 

affedve state and the selected action 

Fig. 1.2: Model of interaction between the avatar and the user 

1.3.3 Previous works 

At a first sight this project could appear like science fiction, however some important 
steps have already been taken in this direction. Important examples of research 
projects are Kismet and a more sophisticated version, Leonardo, from the MIT 
Media Lab [21, 22]. 

Kismet is a social robot able to communicate its internal emotive state by emu- 
lating the basic emotions of humans. It can communicate its emotive state through 
facial expressions, body posture, gaze direction and quality of voice [23]. Kismet's 
facial expressions are generated over a three-dimensional affect space, where each 
point (or cluster of points) of this space governs a specific facial expression, inter- 
polating it along neighbouring points. 

The three dimensions correspond to arousal (high/low), valence (good/bad), and 
stance (advance/withdraw) [23] and are inspired to the theory of core affect of Russell 
(see Section 2.2.6). As core affect theory claims, the current affective state of the 
robot is represented by a single point of this space. The dynamic of this point along 
the three axes of the space, involves the change of its facial expression on the basis 
of the trajectory of the point representing its internal affective state. 

The affective state of Kismet can change with an interaction with it. It could for 
example get bored if the interaction is not so exciting, or get surprised if a particular 
interesting object is shown to it. However Kismet has no real cognition of the 
surrounding world and most of its interactions with the world are low-level processes 
based on more or less sophisticated saliency map representations. Unfortunately, the 
cognition of things and persons, the theory of mind, the recognition of self and other 
broader questions are issues very complicated to cope with, and for this reason the 
major efforts of Kismet's creators are on human-robot interaction leaving open, at 
least for now, issues most related to artificial intelligence. 

Leonardo is the evolution of Kismet, and for this reason, it has more powerful 
capacities of expression and it is able to learn through imitation and spatial scaf- 
folding. It has 64 degree of freedom and, unlike Kismet, it has a complete humanoid 
body. The design is targeted for rich social exchanges with humans as well as phys- 
ical interactions with the environment [24] . It is able to communicate with gestures 
and facial expressions to people and it has the ability to manipulate objects. 
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The robot can locate and identify the facial features of a human partner by 
using a camera and a software of facial features tracking. Thus, Leonardo is able to 
perceive a scene and understand what it is currently happening using a tree structure 
where each leaf of the tree is specialised to extract a specific feature, like faces. 

An action system is responsible for behaviour arbitration of the robot, drives it 
through a decision-making process and instructs the motor system on how physically 
implement the action selected. 

The learning process occurs through an imitative interaction inspired by de- 
velopmental psychology theories. There are two phases: the first one consists of 
the imitation of Leonardo's facial expression by humans, and the second one where 
Leonardo imitates human facial expression. This continuous cyclic process leads to 
an imitative learning of facial expressions through a form of emphatic consciousness. 

There are many similarities between the works of Breazeal and the work described 
here. Clearly, it is impossible to consider all the aspects and features quoted above 
in a single thesis work. For this reason, we will begin to concentrate on the first step 
of the project that will be the basis for future work. 

Here, we will mainly consider aspects concerning the detection and modelling 
of facial expressions, as a first step for investigating and design models of affective 
social interaction. 

As usual in most of computer science's studies, the main task of a computer 
scientist is to model a complex and real problem in a simple and tractable one, 
eventually with some simplifications. This task is often accomplished using mathe- 
matical models. 

The first part of the work investigates machine learning algorithms able to gen- 
erate a model enough informative for our purpose. Methods of dimensionality re- 
duction will be essential to make the problem tractable in term of time and space 
complexity. 

The second part of the project detection of faces and the consequently extraction 
of important cues about facial expressions will be crucial. In this effort, computer 
vision techniques and image processing algorithms play a central role. 



Chapter 2 

A world of emotions 



Human behaviour flows from three main 
sources: desire, emotion, and knowledge. 



Plato 



2.1 What is an emotion? 

It is not so simple to answer the original question posed by James [25]. If someone 
asks to us to define what is an emotion, probably we will be in serious difficulties 
and we will describe the concept by examples. Psychology researchers, during these 
past decades, tried to understand what an emotion is, but at present we have only 
several theories that often are in conflict with each other. 

Our brain is very complex and, though neurosciences have made great stride in 
recent years, we do not have sufficient knowledge about its behaviour yet. Never- 
theless we have some important cues that allow us to exclude some theories and to 
make plausible others. Anyway, what is an emotion, and why it is so difficult to give 
a clear definition? 

Emotion is a complex set of interactions among subjective and objective fac- 
tors, mediated by neural hormonal systems, which can (a) give rise to affective 
experiences such as feelings of arousal, pleasure/displeasure; (b) generate cognitive 
processes such as emotionally relevant perceptual effects, appraisals, labelling pro- 
cesses; (c) activate widespread physiological adjustments to the arousing conditions; 
and (d) lead to behaviour that is often, but not always, expressive, goal-directed, 
and adaptive [26]. 

Anyway, providing a single, simple and clear definition is not an easy task, and 
probably it is not even feasible or correct. 

In the history of Western thought and more specifically in the most recent science 
of affects, different approaches have addressed different aspects of such a complex 
phenomenon, hence giving rise to different definitions. 

These main theories of emotions will be described in the next section. 
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2.2 Main theories of emotions 

Several theories were born during last decades to try to define what is an emotion 
and to understand the mental processes that take place during an affective process. 

In this way, the concept of emotion is expanded to the concept of "emotive 
episode" , from which can be extracted several components. Among these there are 
for instance a cognitive component, a sensational component (emotive experience), 
a motivational component (actions tendency), a somatic component (physiological 
responses) and a motor component (expressive behaviours of the emotion) [27] . 

Each theory has a different interpretation of the several single components during 
an emotive process. These differences can be either the number of components 
identified in an emotive episode or the definition of emotion itself defined as one or 
more components of the emotive episode. 

Other differences among theories concern the representation of emotions, which 
can be classified into a discrete number of classes or represented by point in a 
multidimensional space, where discrete classes are nothing more than a cluster of 
points. Inside each of these two strands we have debates concerning, for example, 
the number of classes and their labels or the number of axes and the relative labels 
of the space's variables. 

To better understand and analyse a theory of an emotional process, Houwer 
suggests a series of questions that should be addressed by theories [27]. The first 
question (Ql), named "elicitation problem" , aims at understanding which stimuli of 
the environment causes an emotion and which does not. This problem includes two 
subproblems: the first (Q1A) asks which stimuli produce an emotion and which not, 
the second (Q1B) asks how the organism performs this task. 

The other two questions concern the quantitative and qualitative aspects of emo- 
tions. The quantitative aspect is known as "intensity problem" (Q2) and include two 
subproblems: the first (Q2A) asks which stimuli cause strong emotions and which 
cause weak ones; the second (Q2B) investigates the mechanisms that determine the 
intensity of emotions. The qualitative aspect is known as "differentiation problem" 
(Q3) and include two subproblems: the first (Q3A) asks which stimuli cause pos- 
itive emotions and which cause negative ones; the second (Q3B) investigates the 
mechanism that determine the qualitative aspect of an emotion. 

2.2.1 James' theory 

James is known to have changed, with his theory [25] , the order of the events in an 
emotive episode. For James it is not the emotional experience that activates facial 
muscles and other physiological responses, but are the physiological responses that 
make the subject feel an emotive experience ("perception of bodily changes is the 
emotion" , [25]). That means, for instance, that we do not tremble because we have 
fear, but we have fear because we tremble and with this physiological behaviour is 
associated the emotional experience of fear. 

Both intensity and quality of emotions are determined by the intensity and qual- 
ity of the physiological responses produced after the stimulus. For this reason James' 
theory give an answer to the intensity and differentiation problem, but not to the 
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elicitation problem, namely it does not specify how the physiological responses are 
produced after a stimulus. 

This theory was highly criticized at the time and more recent investigations and 
experiments have provided controversial evidence. 

However, the idea of emotion as embodiment is a long lasting trend and a more 
sophisticated modern and revised version is that proposed, based on recent neuro- 
biology findings, by Damasio [28]. 

Most important from the Affective Computing perspective, is that under the ra- 
tionale that emotional experience is embodied in peripheral physiology, systems can 
detect emotions by analysing the pattern of physiological changes associated with 
each emotion (assuming a prototypical physiological response for each emotion ex- 
ists). The amount of information that the physiological signals can provide is every 
day increasing, mainly due to major improvements in the accuracy of psychophysi- 
ology equipment and associated data analysis techniques [29]. 

2.2.2 Affect program theory 

Close to the emotion as embodiment approach, one can find the evolutionary or 
affect program approach [30]. 

The aim of affect program theories is not to explain the process of an emotive 
episode, but to explain how a stimulus is able to cause the effects of a particular 
emotion selected during an emotive episode. 

This theory supports the idea that, during evolution, were created dedicated 
neural circuits for each of the six basic emotions. If the activation of the specific 
neural circuit pass the threshold, a program of the selected circuit will be activated, 
that is physiological signs, tendency of actions and emotional feelings are manifested 
in the subject. 

Basic emotion theories, inspired by Tomkins [31] rediscovery of Darwin's [32] 
work on the expression of emotion, were developed by Ekman [33] and Izard [34] 
(cfr., Section 2.4 below). 

2.2.3 Schachter's theory and the shift towards cognition 

Schachter [35] tries to solve the critics and issues of the original James' theory. He 
states that a stimulus has the ability to produce physiological responses in a subject 
and that responses are then interpreted in a next step by a cognitive process. This 
in turn identifies the particular emotion and consequently displays it in the subject 
("perceived arousal leads to labelling feelings as an emotion based on situational 
cue"). 

In this way, the physiological responses are not the only causes of an emotive 
state, but it is involved a cognitive process too. For instance, a dog that dazzle to 
us and meeting our boyfriend can cause the same level of arousal, however it is only 
at the level of cognitive process that we attribute in the first case a danger rather 
than joy in the second case. 

Nevertheless, this theory is not able, as well as James' theory, to answer the 
elicitation problem; in fact, after the physiological stimulation phase, it considers as 
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main component a not well denned cognitive process. 

2.2.4 The cognitive approach: Appraisal theory 

Appraisal theories [36, 37, 38] are at the core of a cognitive approach to emotions 
and are usually opposed to the emotion as embodiment approaches. 

This theory supported by several psychologists in the course of several years, 
argues that cognition is antecedent of emotion, as well as Schachter did, however 
it does not give to this cognitive process a conscious factor, but an automatic or 
unconscious one. 

This idea born after critics moved by Zajonc [39] to Schachter 's idea, who shows 
how it is not necessary a conscious cognitive process to display an emotion; nev- 
ertheless the data presented by Zajonc do not demonstrate the inexistence of an 
unconscious cognitive process. 

Another difference with Schachter's theory is that appraisal theory inserts the 
cognitive components immediately after the stimulus and before the physiological 
response of the subject. In this way it is the cognitive components to have the 
task of establishing what kinds of physiological responses cause to the subject, on 
the basis of the stimulus characteristics and so answering to the elicitation problem 
(Ql). Furthermore, is the cognitive component again to determine quality (Q3) and 
intensity (Q2) of the emotion. 

After the emotional experience of the subject, it is put another cognitive process, 
however this one is conscious and it serves to attribute the correct emotional label. 
This means that it is the subject itself to determine an emotional label after feeling 
an emotive episode; however it is the unconscious and automatic cognitive process 
that determines the physiological behaviour of the subject after a stimulation. 

Emotion researchers supporting this theory tried to understand which kind of 
stimuli produce emotions and which do not. Nevertheless it is difficult -if not 
impossible- to determine one-to-one relationships between stimuli and emotions, 
because they depend on the subject's beliefs and on the context. 

Anyway, it emerged that these stimuli could be identified by variables and these 
variables could be easily classified in such a way that they help to determine the 
intensity and quality of an emotion. Each of these variables considers a particular 
aspect of the stimulus. A set of these values creates an appraisal pattern that, by 
assumption, is in relationship with one particular emotion. An example of appraisal 
variable is the relevance to the goal. 

Summing up, these theories assume an emotion architecture that is based on an 
individual subjective evaluation or appraisal of the significance of events for their 
wellbeing and goal achievement, postulating a specific set of appraisal criteria (e.g. 
novelty, intrinsic pleasantness, goal conduciveness or motive consistency, agency, 
responsibility, coping, legitimacy and compatibility with self and societal standards). 

2.2.5 The cognitive approach: Network theory 

Bower's NetworkTheory of Affect [40] is another variant of a cognitive approach to 
emotion. It supports the idea that emotions are coded into memory and the acti- 
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vation of these memories is the main cause of emotion (Ql). At the beginning only 
few relevant stimuli cause emotions, then these stimuli are progressively processed 
through conditioning procedure. 

When a new emotive episode occurs, the information about the stimulus and the 
related physiological responses are coded into memory and, through a continuous 
matching of the stimulus with another one previously coded into memory, it takes the 
same emotional valence of the previously learned stimulus. If the new stimulus does 
not match with others stimuli of different schemas, a new schema will be matched 
with a generalization process. 

This theory answers the elicitation (Ql) and differentiation (Q3) problem by 
adding a cognitive component, which uses the schemas coded into memory to match 
a new stimulus into the most suitable of these. The intensity of the emotion (Q2) 
depends on the intensity of the activation of the schema selected by the cognitive 
component. 

2.2.6 Theory of the core affect of Russell and of conceptual 
act of Barrett 

With respect to previous approaches, Russell's aim [41] is to provide a synthesis 
under the rationale that previous and competing approaches basically addressed 
different. 

In this perspective, the idea that emotions can be reduced into a limited number 
of classes, it is not realistic. Russell supports the idea that these categories are 
not given by nature, but are artificial constructions made by society and culture. 
On the contrary Russell claims the existence of emotive variables of valence and 
arousal, which are the real basic constituents of emotional life: "continuous core 
affect constituted by valence and arousal is interpreted and categorized in the light of 
situational cues" [41]. 

It is possible to do a comparison between emotions and colours. Society and 
culture have categorised colours into classes like red, blue, yellow... however we 
know that there are a lot of shading inside a single category and all depends on 
the wave length of light stimulus, which is a continuous variable. At the same 
way, emotions are categorised into classes like anger, happiness, joy... but the real 
constituents of these are two continuous variables of valence and arousal. 

These variables are defined as property of stimuli, of the neurophysiological states 
and of the conscious experiences. The combination of both valence and arousal 
values it is called "affective quality" . 

The affective quality of a stimulus cause in the subject a state called "core affect" 
with consequences both neurophysiological and mental. 

Barrett [42] agrees with Russell's vision, however she tried to understand how 
these points of affective space can be categorised into classes of emotions. To fulfil 
this task, she proposed a theory made of two phases, the first phase in which the 
stimulus is mapped into a core affect, and a second phase in which the core affect is 
categorised. 

However, for Barrett, the categorisation of core affect is not a form of learning 
coming from experience, but something that helps to create and cause the emotive 
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experience in the subject ("core affect is differentiated by a conceptual act that is 
driven by embodied representations and available concepts"). The categories are 
not statics, but they depend on the perception of the emotive state on the basis of a 
previous conceptual knowledge. Furthermore, Barrett supports that the two phases 
are not sequential steps, but are two sources that influence each other until they 
reach a stable solution. 

2.3 Psychophysiology of emotions: foundations 

An emotional response consists of 3 different components: a behavioural component, 
a vegetative component and a hormonal component [43]. 

The behavioural component consists in appropriate muscular movements on the 
basis of the current stimulus. 

The vegetative component facilitates the behavioural one, providing a rapid mo- 
bilization of the energy, to allow strong movements. For instance, the increasing of 
heart rate and changes of the diameter of blood vessel allows the blood to go to the 
muscles. 

The hormonal component enhances the vegetative responses. Hormones secreted 
by the adrenal medulla (adrenalin and noradrenalin) augment the blood flow through 
the muscles and stimulate the conversion to glucose of the nutritive substances stored 
therein. 

We summarized below a brief list of neural structures involved in affective pro- 
cesses [44]. 

Amygdala - The central nucleus of the amygdala is the most important region 
for the expression of emotional responses to harmful stimuli. For this reason is one 
of the most important structure in emotion research and it was proved that it has 
a crucial role in the processing of emotions: assigning an emotional value to the 
perceived objects and environment. After the destruction of the central nucleus, 
the animals do not show any signs of fear also if in presence of stimuli associated 
with harmful events. Vice versa, if the amygdala is electrically stimulated, the ani- 
mals show signs of fear and a long stimulation causes stress diseases and gastric ulcer. 

Orbitofrontal cortex - This area is known to be active in the recognition of 
emotion displayed by faces. It have also a role in the conditioning, in fact its acti- 
vation is related to the value of an expected punishment or reward from a certain 
action. 

Anterior cingulate cortex - The ACC is involved in the decision making and 
in premotor functions. It is important for its production of emotive responses, for 
example the arousal, as well as for the regulation of own emotive state and the per- 
ception of the pain. 

Insula - This region is activated during the recognition and production of the 
disgust, however it is been discovered that it responds to sadness, fear and to re- 
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wards. It is involved also in aspects concerning sensation of pain. 

Nucleus accumbens - This area, part of the ventral striatum, is involved in 
conditioning processes and in anticipation of rewards and punishments. 

Thalamus - This structure transmits sensorial information to the rest of the 
brain. 

Ventral tegmental area - In this area is generated a prediction of an error 
signal that is positive when a reward it is not expected and received, and negative 
when an expected reward is not received. This signal is controlled by a dopamine 
neurotransmitter. 



2.4 Emotions in social interactions 

Emotions are one of the most effective mediums of non verbal communication that 
humans have at their disposal to communicate simple but with big impact informa- 
tion. For instance a scream it is actually poor of direct information, nevertheless 
it gives to other subjects in the area indirect information of warning for an alert 
situation. 

To understand better the communicative power of facial expressions, it is possible 
to do an experiment. Turn on the television on a channel with a film and remove the 
audio. Even without any dialog, it will be possible to understand what is currently 
happening by observing the facial expressions and gestures of the subjects in the 
video. 

Several species of animals, including humans, communicate their emotions to the 
other through postural changes, facial expressions and non verbal sounds. These 
expressions allow to fulfil several social functions, for instance they communicate to 
the others what we are feeling and so what we are probably going to do [43]. 

An effective communication is a bidirectional process; this means that our ex- 
pressiveness is useful only if others are able to collect our emotion cues and to 
interpret them. A study proposed by Kraut and Johnston [45] shows how humans 
are much more likely to smile when they are engaged in a social interaction with 
another person than when they are solitarily experiencing a pleasant emotion. 

These early examples are sufficient to show the fundamental role of emotions 
in our social life. So, emotion researches are questioning if facial expressions con- 
figurations are innate or learned from environment. Darwin stated that emotional 
expressions are innate and to support this idea he collected a series of positive evi- 
dences. He observed the facial expressions of his sons and the expressions of member 
of other isolated cultures around the world. He argued that if people around the 
world display the same facial expressions of emotions, then these expressions have 
to be necessarily hereditary rather than to be learned, for the reason that a pro- 
longed isolation of different communities of people leads to development of different 
languages, just because there are no biological basis for the development of the Ian- 
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guage that justify the use of particular words for particular concepts. Conversely, 
facial expressions seem to be the same among the different cultures and this means 
that these are innate. 

Ekman and his colleagues confirmed the idea of Darwin with others positive 
evidences observing the facial expressions of blind children, with respect to seeing 
children, and the emotive responses of isolated indigenous [46] . 

In addition to this innate behaviour of universal facial expressions after the input 
of a particular stimulus in the subject, it seems to exist also an imitative behaviour, 
in which mirror neurons play a crucial role. 

Mirror neurons are activated when the animal do a particular task or when it 
is observing another animal which is doing the actions under consideration. This 
particular neural circuit is activated when we are observing another person that is 
doing a particular action and the feedback of this one could help us to understand 
what the other person is trying to do. For this reason it seems that mirror neurons 
are involved in the acquisition of the capacity of imitate other people's behaviour. 

Therefore, according to some researchers, this biological component gives to us an 
internal feedback that helps us to understand what others are feeling when expressing 
an emotion through their faces and consequently it allows us to behave in a correct 
way during social interactions [43]. 

Imitation is probably one of the channels through which organisms communicate 
their emotions and regulate social interactions. For instance if we see someone 
looking sad, probably we will tend to display a sad facial expression too. The 
sensorial feedback contributes to put us in other's shoes and then it makes us more 
willing to provide comfort and support. This is probably one of the reasons of the 
pleasure of making people smile, because its smile makes smile us, rejoicing [43]. 

Emotions do not have only a role into communication processes, but it seems 
that they are crucial also for the evolution of society [19]. It is common to feel 
uncomfortable during the feeling of negative emotions, often not only psychics but 
also physical. Moreover, events that cause strong emotional responses are probably 
more remembered in the future [43]. 

Why do these negative emotions exist if they are so harmful to our mind and 
body? It seems that there is an analogy with the pain and the nervous system: if 
we touch a flame with a hand we burn ourselves and we feel pain, this is because in 
this way we are able to have an instant reaction and to remove the hand from the 
fire, limiting the damages to our skin and body. At the same way, emotions work 
like alarms to regulate the social life and allow the evolution of society [19]. 

For instance, if a person is reprimanded at work for not having well its duties, 
this one will feel embarrassment and then psychic and physical malaise. This in turn 
leads the subject to avoid this behaviour in the future, in order to not feel again this 
bad affective state. In this way, since the subject will work better from now on, the 
society obtains a possibility of evolution. 

Conversely, positive emotions, lead the subject to feel pleasant feelings and then 
the research of these ones, thus leading welfare in society. For example, be of help to 
someone and then be reciprocated with a smile and thanks, leads us to a comfortable 
feeling, allowing us to repeat these behaviours in such a way to feeling again these 
sensations and creating a positive evolution for our society. 
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Summarizing, we have two fundamental roles of emotions: the first role is as 
external feedback that allows us to add to the message important features during 
the communication process, and the second role of internal feedback, which allows 
us to learn what kind of behaviours are advantageous and what are harmful, so that 
we are able to enhance our decisional process and the relationship with others during 
social interactions. 



Chapter 3 

A model for facial expression 
analysis 



A graphic representation of data abstracted from the banks of 
every computer in the human system. Unthinkable 
complexity. Lines of light ranged in the nonspace of the mind, 
clusters and constellations of data. Like city lights, receding. 

William Gibson 



3.1 Automatic affect recognition 

For humans, affect recognition is a rather simple task. We are able to easily rec- 
ognize an emotion through multimodal cues, such as the means of language, vocal 
intonation, facial expressions, hand gestures, head movements, body movements 
and postures [47]. It is not possible to say the same for today machines; in fact 
researchers are far from good results in this domain, where the reasons are due to 
several issues. 

First of all we do not still have a unique and sound psychological model of 
emotions, and also neurosciences studies are not still able to lead us to a promising 
theory; thus, there are still present wide spaces of development without a shared 
and clear direction among the researchers. 

Other issue involves the sensors currently available. Human perception is able 
to "produce" images and sounds of the world surrounding us with a high level 
quality, which is far from the quality of our electronic sensors. Often, computer 
vision researchers, have to face problems like the high dynamic range, image noise 
(especially in low illumination conditions), reflections and so on, causing a general 
decay of performance. Furthermore, we have still difficulties to develop techniques 
for objects recognition in images or natural language processing from audio, adding 
further barriers to the purpose achievement. 

Finally, there is also a problem concerning the emotion datasets currently avail- 
able. There are two possibilities to produce such datasets: using acted facial expres- 
sions or employing subjects engaged in real life behaviours. In the first case it is 
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possible to manage the scene and the subjects in order to produce videos of higher 
quality, however the expressions of the subjects are not genuine and often far from 
real life expressions: too much exaggerated or limited to a small and not enough 
informative set of expressions. In the second case it is possible to collect several 
genuine facial expressions, but the quality is not always good and data are too much 
heterogeneous for a regression procedure (see Chapter 5). Another problem of emo- 
tion datasets is the intrinsic difficulty to produce objective labels of emotions, for 
the reason that often are present a lot of different shades on emotion attribution, so 
several labels for the same facial expression are possible. 

Early researches on emotions recognition dedicated more efforts to the recogni- 
tion of the six basic emotions theorized by Darwin and afterward investigated by 
Ekman with cross-cultural studies [48] . The advantage of considering a well defined 
subset of categorical emotions is that the latter match well people's experience; in 
fact it is easy to categorize one of these prototype expressions in one of the ba- 
sic category of emotions. However, reduce the recognition of emotions to a set of 
few categories it is not so interesting and informative for applications. Furthermore, 
these prototypical facial expressions consider only a small part of our every day social 
life, and become useless for applications that want to consider a human-computer 
social interaction. 

For these reasons, researchers are changing direction by focussing their attention 
to the representation of emotion in a continuous space. This view derives from Rus- 
sell's core affect theory that support the existence of a small set of (latent) variables 
able to describe all the possible emotions [41]. Researchers try to understand what 
kind of variables could be the best choice in order to describe emotions. Currently 
some examples of variables investigated by researchers are: valence, arousal, control, 
power, dominance. In particular valence and arousal are often used in several studies 
concerning emotions recognition, because they seem to reflect the main aspects of 
emotion [49]. The valence quantifies the positive or negative valence that the subject 
feels during an affect state. The arousal measures how the subject shows an active 
or a passive affective state. 

By using these two variables it is possible to represent emotions as points in a 
2D space, topologically divisible in four quadrants: the positive-active quadrant in- 
cluding emotions like joy or happiness, the negative-active quadrant comprehending 
emotions like anger or fear, the negative-passive quadrant enclosing emotions like 
depression and tiredness and finally the positive-passive quadrants having emotions 
like serenity and calm (Fig. 3.1). 

Note that by considering such core affect variables as random variables of a 
continuous latent space, opens up the possibility of applying a variety of methods 
that in the last decade have been developed in statistical machine learning [50, 51, 
52]. We will further discuss such option in later sections of the work presented here. 

On the other hand, the problem of this approach is that if classification is specif- 
ically addressed, the task is not as intuitive as working in the discrete, categorical 
representation of six basic emotions. This issue produces a series of difficulties on 
the labelling task, which is necessary in order to have emotions datasets usable as 
training sets for the regression task. 

Most of the works in the literature, either dealing with the classification into six 
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Fig. 3.1: Example of a 2D latent space of emotions 



categories and the regression of a latent space of emotions, make use of visual signals 
for the recognition of the affect state. Several hypotheses are supported by psycholo- 
gists and linguists about the importance of different cues in human affect judgment. 
Whereas it seems that the relative contributions of facial expression, speech and 
body gestures to emotion classification depend both on the current affective state of 
the subject and the surrounding environment, some studies support clues in favour 
to a major contribution of facial expressions in affect judgment [10]. Furthermore, 
the integration of multiple modalities, such as vocal cues, facial expressions and 
gestures, allows a better classification of emotions by humans [53] . 

It is possible to divide current research on facial expressions recognition in two 
main streams: the recognition of the affect and the recognition of facial Action Units 
(AUs). The facial AUs are descriptors of the movements of facial muscles [54]. When 
a subject produces a facial expression, this one can be described as a combination of 
a subset of AUs activation. As AUs are independent of interpretation, it is possible 
to use them as high-level decision-making process in order to recognize emotions. 
Ekman proposed a system to analyse AUs and map sets of AUs to particular affective 
states, the Facial Action Coding System (FACS) [54]. 

There are several methods for the classification of the facial features into cat- 
egories of emotions or points of a latent space, many of these make use of well 
known regressors such as Support Vector Machine (SVM), Relevance Vector Ma- 
chine (RVM), Neural Networks, Decision trees, and so on. Yet, there is no a rule 
for determining the type of regression to use for emotion recognition, all depends on 
the final purpose and expectations. 
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3.2 The proposed approach 




Fig. 3.2: Graphical model of the gen- 
erative approach 



Inspired by Russell's theory of core affect (see Section 2.2.6) our aim is to discover 
a mapping between the visible facial expression of a subject (activation of several 
facial muscles) and latent variables of a core affect space, so to describe the complex 
behaviour in a more simple way. 

We can consider our visual experience 
as a set of observations during the time, 
each one produce a latent variable of core 
affect 1 . An observation at time t is condi- 
tional dependent by observation at t— 1. The 
same holds, in principle, for latent variables, 
namely a latent variable at time t is condi- 
tional dependent on latent variable at time 
t — 1. If we consider this behaviour in a 
(probabilistic) generative framework, where 
the observation is " sampled" from the latent 
variable describing the latent state space, we 
can describe the process through a graphical 
model as depicted in Fig. 3.2. 
Interestingly enough, the exploitation for recognition purposes of a generative 
model of visible expressions from an hidden core affect space shares some connec- 
tions with the simulative approaches to social interaction: we are able to infer people 
affective states from their visible expressions, since we are ourselves capable to inter- 
nally simulate "as if" (generate) such expressive behaviour, and compare simulated 
and actually observed behaviours with one another. This is for example one of the 
theories endorsed by mirror neurons theorists (cfr., Section 2.4). 

In this work we are not interested to con- 
sider the affective valence of all the possible 
events and situations in our world, but only 
of facial expressions of subjects. For this rea- 
son we can interpret a frame of a video as 
an image with both a background content 
Bq , and a foreground content Fq- The fore- 
ground content contains the pixels useful for 
our purpose, namely the pixels of the face of 
the subject, whereas the background content 
includes all the other unnecessary pixels of 
the image. So our observation (the image) 
is conditioned by a foreground and a back- 
ground random variables as in Fig. 3.3. 

If we consider the background as a constant, at most with some noise, it is pos- 
sible to no longer consider it in the graphical model. Now extending our previously 
introduced model (Fig. 3.2) we obtain the general schema proposed in this work 




Fig. 3.3: Graphical model of an ob- 
served image I 



1 Remind that this concept is in general valid for every type of observation, in fact also a 
perception of an object or of a particular event has an affective valence, at most neutral 
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(Fig. 3.4). 

The model outlined in Fig. 3.4 can be formalized as follows, by considering the 
time slice t, t + 1. Let i t +i, it, s t +i, s t , yt+i, Vt, Xt+i, x t denote samples from the 
random variables I t+i , I t , S t +i, S t , Y t+1 , Y t , X t+1 , X t , respectively 

1. Sample the affective state at time t + 1, conditioned on the previous affective 
state: 

Xt+i~p(xt+i\x t ). (3.1) 

2. Sample the feature based representation of the visual expression Y t+ i on the 
basis of the current affective state and the visual expression at time t: 

Vt+i ~ p(y t +i\yt,xt+i). (3.2) 

3. Sample the state (position and scale) of the face inside the current frame, 
namely defining the region of interest (ROI), conditioned on the previous state 
of the face: 

s t +i~p(s t+1 \st). (3.3) 

4. Sample the observed scene (the frame at time t + 1) from the current feature 
based representation of the visual expression, the current state of the face 
and the previous scene (we omit here for simplicity the constant background 
component B g ): 

it+i~p(*t+i|»t,«t+i,&+i)- (3-4) 

The above sampling steps fully describe the probabilistic generative model of 
facial expressions from affective states. Clearly the actual aim of this thesis is to 
provide a method for inferring the hidden affective state x t +i from the current feature 
based representation of the visual expression yt+i occurring in the observed scene 
it+i in position defined by si+i- 

In general terms such inference should be accomplished by "inverting the arrows" , 
formally by using Bayes' rule: 

p(x t +i\it+i) = j-. — v — , (3.5) 

PVH+i) 

where the two terms p(x t+ i,i t+ i) and p(i t+ i) may be obtained by marginalizing the 
joint probability: 

P\H+i,h, s t +i, St,yt+i,yt,x t +i,x t ) = , ■. 

p(i t+ i\i t , s t+1 )p(s t+1 \s t ,y t+1 )p(y t+1 \y t , x t+1 )p(x t+1 \x t ) 

in such a way that: 
P(x« + M, +1 ) = fp{i t+ uiuS t+l ,s t , yw ,y tXw , X ^I4S w d S 4Y t+l dY4X t - (3.7) 

P(it+i) = / p(xt+i,it+i)dX t+1 . (3.8) 
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Fig. 3.4: Schema of the model proposed 
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Clearly, the integration of Eq. 3.7 and Eq. 3.8 cannot be computed in close form 
and some sort of approximation must be found. 

The complexity of this inference/inversion task is not only formal but lies also in 
subtle technical issues. At the bottom of the process (cfr. Fig. 3.4 ) we have a frame 
of a video stream. It is clear the temporal dependence among consecutive frames, 
for the reason that we are considering a video stream with frames ordered over time. 
Obviously each frame of the video has a lot of useless parts, so it is necessary to 
extract only the region of interest (ROI) containing the face of the subject (the 
foreground variable presented in Fig. 3.3) defined by the current state of the face 
in the frame (st). However Eqs. 3.3 and 3.4 show that the ROI extraction must be 
accomplished over time. That means that ROIs inference must involve a filtering or 
tracking procedure. We will resort to Kalman filtering (see Section 4.3.1) to preserve 
the temporal conditioning between consecutive ROIs (Eq. 3.3). 

Consider then the feature representation Y t+ i of the visible expression; we will 
assume in the simplest case that this is a normalized representation of the ROI 
(but we have experimented with more complex features, too). Indeed, the face of 
the subject is likely to be misaligned and/or suffer of different light conditions, so 
such normalization is necessary. Through this step we obtain a face picture of a 
fixed size and with eyes, nose and mouth placed in fixed positions. In this phase we 
miss the explicit temporal constraint between consecutive observations because, as 
it will be showed in the next chapter, whereas we make use of a Kalman filter to 
track facial landmarks positions (that are used to rotate and resize the face) over 
time, we do not have any temporal information for the light normalization task. 
Thus p(Y t+ i\X t+ i) ~ p{Y t+ i\Y t , X t+ i). However observing that ROIs are captured in 
a same video footage with approximately uniform light conditions, we can assume 
that these dependences are implicitely accounted for by lower level conditioning 
in the graph. In Fig. 3.4 we stressed this simplifying assumption by drawing the 
temporal dependences in lighter gray. 

Finally a dimensionality reduction is made in order to describe the face expression 
with a few latent variables, and a core affect space is learned on the basis of a training 
set of facial expressions. Then we expect that an emotive episode is nothing more 
than a path along several points of this latent core affect space, so a classification can 
be made on the basis of the properties of these paths. Also in this step are no more 
considered temporal dependences between latent variables (p(X t +i) — p(X t +i\X t )), 
however with a regression process that maintains close points in the data space close 
also in latent space and vice versa, it is possible to infer an intrinsically temporal 
dependences between latent variables of consecutive frames. 

In this chapter will be considered the process of dimensionality reduction, whereas 
in the next chapter will be treated the problem of face extraction and normalization, 
summarized with the name "face sensing". 

3.3 Desirable characteristics of the latent space 

For our aim, the space of latent variables must have two characteristics: 

1. Facial expressions that are similar must lie on close points of the latent space; 
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2. The transitions between neighbouring points in the latent space produce smooth 
transitions in the data space. 

Furthermore the process of regression must to make use only of the face expression 
images without the use of labels, namely our observations are only the current pixels 
of the face. 

The two first characteristics are necessary in order to set up a model where it 
is possible to easily investigate temporal dynamics of facial expressions. If similar 
facial expressions lie on similar location, it is possible to expect that the path created 
by a set of consecutive frames generates a smooth and easy to model spline. 

Modelling each temporal facial expression with a spline, allows to remove par- 
ticularly difficult issues related to the duration of an affective state, that is one of 
the major challenging in emotion recognition [10]. Furthermore, as our work want 
to investigate affective states in a social view, it is possible to check the existence of 
a model able to describe interactions between splines, which can be used to recog- 
nize and predict the affective character of a social interaction occuring between two 
subjects. 

Finally, using unlabelled information it is crucial in order to simplify the data 
collection task. By creating a mapping directly between feature points and latent 
variables (variables not chosen by other people and so not biased) it is not necessary 
to label the audio-visual material that often is a quite difficult task. 

First of all the labelling task need the decision of the latent variables to use 
during the evaluation process, this implies the need to choose these variables and 
not other, unfortunately we have already told that at the moment there is not a 
unique and clear psychological theory on that issue. 

The second problem is that the labelling process, even if it is done with the use 
of several subjects, it is a subjective task and the use of graduated scales are not 
sufficiently precise for producing good quality training data. In our vision, as we told 
previously, the classification task, namely the labelling process, has to be done in a 
next phase investigating the regressed latent space of emotions (a similar approach 
is proposed by Huang et al. [55]). 

This last point suggests a procedure of unsupervised learning, where, as we told 
before, a dimensionality reduction algorithm can be the right approach. However 
it is important that during the process of dimensionality reduction the variables of 
the latent space remain sufficiently informative to reconstruct the observations, and 
that the topology of the new latent space produces smooth dynamics among all the 
facial expressions. For these reasons linear dimensionality reduction techniques are 
not sufficient to our purpose. 

At a first sight it may seem necessary to extract only a few representative features 
from the raster image of the face in order to reduce the dimension of the observations 
and consequently having more chances to regress a performant mapping from data 
space to latent space, however doing this way we only miss important information 
of the face texture, such as facial wrinkles. Furthermore, techniques such as Active 
Shape Models or Active Appearance Models (see Section 4.4) introduce a further 
noise on the observations, being the facial features extraction a not so easy task. So, 
the use of the pixel values of the face image may be an appropriate choice for the 
task of complex facial expressions recognition. 
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In this work we propose to use Gaussian processes and in particular of Gaussian 
Process Latent Variable Model (GPLVM) as method of dimensionality reduction, 
since they possess several interesting properties useful to achieve our aim, as will be 
clearer in next sections. 

3.4 Background on Gaussian processes 

Gaussian processes (GP) are hugely powerful tools for regression, nevertheless they 
play an important role in the theory of probability. Gaussian processes provide 
a principled, practical, probabilistic approach to learning in kernel machines [50]. 
They are used in several domains, for example one application domain is geostatis- 
tics, a branch of statistics studies phenomena with spatiotemporal character. An- 
other application of GP is to model things evolving over time, for example a face 
expression given a set of video frames, like what we are investigating in this work, 
or the 3D pose of a person given a series of 2D silhouette [56] . Let us now formally 
define a Gaussian process: 

Definition 1. For any set X , a Gaussian Process on X is a set of random variables 
(f(x) : x G X) s.t. Vn G N and V#i, ...,x n G X, (f(xi),...,f(x n )) is a multivariated 
Gaussian distribution 

Whereas a probability distribution describes the distributions of scalars and vec- 
tors, a stochastic process describes distribution of functions. Simplifying, it is pos- 
sible to think a function f(x) as a very long (infinite) vector where each entry in 
the vector is the value of the function at x. A Gaussian process defines a prior over 
functions; in fact it is a generalization of Gaussian distributions and, as a stochastic 
process, it describes a distribution of functions. To clarify better how GP works in 
practice, we will now introduce a trivial example: 

Example 1. Let X — R and W ~ A/"(0, 1), than f(x) = xW is a Gaussian process 

In this example we have realized a set of random lines (Fig. 3.5). We have 
defined a model of function (f(x)) which is modulated by a normal distribution on 
W, generating a set of normal distributed linear functions. 

As a Gaussian distribution over functions, a Gaussian process can be uniquely de- 
scribed by its mean and covariance functions, respectively m(x) and k(x, x') defined 

as: 

m(x) = E[/(x)], ( „ q , 

k(x,x') = E[(f(x)-m(x))(f(x')-m(x'))}, { } 

that is a real process f(x) can be defined as: 

f(x) ~ gV(m(x), k(x, x 1 )) (3.10) 



if and only if for all n > 1 and xi,...,x n , (f(xi),...,f(x n )) ~ Af n (/j,K), with \x 
[m(xi), ...,m(x n )] and [K]^ = k(xi,Xj). 
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Fig. 3.5: Example of random lines drawn from GP prior 



Usually the mean function is specified as a zero function, whereas the covariance 
function is often a non-linear one. It is clear that the covariance function plays an 
important role in the regression task, since it encodes all the necessary information 
on f(x), such as its local smoothness, continuity, periodicity, etc. 

If we specify a zero function as the mean function of the Gaussian process, the 
covariance k(x, x') results: 



k(x } x')=nf( X )f(x')} 



(3.11] 



xjxj + c with c a constant. 



that is the covariance function is completely defined through dot products (always 
thinking f(x) as a long vector with infinite dimension). Here, if there is a kernel 
function K such that K(xi,Xj) = f(xi)f(xj), it is possible to use only K in the 
training algorithm, without even esplicity know what / is. A kernel function must 
be continuous, symmetric, and positive semi-definite Gram matrix. 

The trivial case is a linear kernel, where K(xi, Xj 
With this kind of kernel, as covariance function of a GP, it is possible to describe a 
set of linear functions, such as those in the example 1. However this model is too 
rigid for most of the regression tasks, for the reason that let us to regress only linear 
functions. 

A Gaussian process prior (Eq. 3.10) allows for non-linear mappings if the kernel 
k is non-linear. There are several kind of non-linear kernels with several free param- 
eters in order to describe sufficiently flexible models for regression. An example of 
non-linear kernel is the Radial Basis Function (RBF) defined as: 



k(x 



%i %j , 



a s exp(- 



1 






/\2\ 



+ <r. 



% 



(3.12) 



with the length-scale £, the signal variance a s and the noise variance a n as free 
parameters of k. These free parameters control the properties of the functions gen- 
erated by the GP and are called hyperparameters. 
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In a simple task of regression it is possible to assume a fixed set of values of the 
free parameters. In this way we are restricting the regression to only those functions 
with specific properties given by the selected set of hyperparameter's values, and 
consequently we are defining a bias for the regression; this is what happens with other 
machine learning methods such as Support Vector Machine or Neural Networks, 
where in fact the most difficult problem is to define a good set of parameters for the 
model in order to allow the algorithm to learn the objective function. 

As it will be shown further, with GPs it is possible to regress also the best 
set of hyperparameter's values that fit better the observed data, augmenting the 
flexibility and consequently the quality of the overall regression. However, now we 
start to define a simple regression task over a GP assuming the absence of noise on 
the observations. 

Definition 2. Given a set X* of input points and a kernel function K(X,X') we 
define a GP f\ ~ Af(0, K(X in X+)) representing our prior. Let f be a set of obser- 
vation on a subset X G X*. Then, the joint distribution of the observations set f 



and the prior /* is: 



AA(0, 



K(X,X) K(X,X+) 
K(X*,X) K(X„X, 



(3.13) 



and consequently the regression task is defined as the computation of the posterior: 

U\X±, X, f ~ Af(m post , k post ) (3.14) 

with rripost and k post computed as: 

m post = K{X ie ,X)K{X,X)~ 1 f 

k post = K(X*,XA-K(X in X)K(X,X)- 1 K(X,X+) ^ ° J 

If our observations are corrupted by noise it is sufficient to consider our observa- 
tion as y = f(x) + e with e an additive independent identically distributed Gaussian 
noise with variance a^. For further details on this less trivial approach we remand 
to [50]. 

As it was told previously, it is possible to regress not only the parameters w of 
the kernel function, but also its hyperparameters 9. This task is usually called model 
selection, where the model is usually indicated by %i. 

Also in this case, it is possible to use a Bayesian approach applying a series of 
Bayes' rules in a hierarchical way, namely inference takes place one level at a time. 
At the bottom level we have to compute the posterior over the parameters as: 

p(w\y,X,9,H,) = MXAHi ) (3 ' 16) 

where p(y\X, w, Tii) is the likelihood and p(w\9, Hi) the prior, encoding a probability 
distribution of our knowledge about the parameters prior. The normalizing constant 
p{y\X,9,T-ti) is independent of the parameters and called the marginal likelihood, 
defined as: 

p(y\X,9,n i )= [ p(y\X,w,Hi)p(w\e,Hi)dw (3.17) 
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A level above we can similarly express the posterior over the hyperparameters as: 

P(0\y, X, 7U) = ^-^ (3.18) 

where it is possible to see as current likelihood the marginal likelihood from the 
previous level, whereas p(9\Hi) is the hyper-prior, that is the prior for the hyperpa- 
rameters, and p(y\T-ti, X) is a normalizing constant defined as: 

p(y\H t ,X) = [ p(y\X,e,H t )p(e\H l )d6 (3.19) 

Finally, at the top level we are able to compute the posterior for the model as: 

r<K\y,x)= p{y]x) (3.20) 

where p(y\X) is now simply defined as: 

p(y\X) = Y t p(y\X,H i )p('H i ). (3.21) 

i 

Gaussian processes place a prior on the space of functions / directly, without 
parameterizing /. Therefore, Gaussian processes are non-parametric. 

3.5 Gaussian Process Latent Variables Model 

Latent Variable Models (LVMs) carry out the idea that data which is apparently 
high-dimensional may actually lie on a low- dimensional non-linear manifold. Con- 
sidering a set of observation (y 1; ...,y n ) G y D and a set of latent hidden variables 
(xi, ...,x n ) G X L with L <C D, we wish to learn a mapping / : X — >■ y such that 
\/i E {1, ..., n}, yi = f(xi, W) + e as W free parameters of / and e an additive noise 
over the observations. 

From what was discussed in the previous section it is clear that GPs are powerful 
tools for regression for several reasons, one above all the possibility to regress a model 
without parameterizing the prior. Furthermore, their ability to works with several 
kinds of kernel functions permits to specify the desired properties of the target 
function. 

It would be interesting to use GPs as a LVM in order to obtain on the one hand 
the dimensionality reduction and on the other hand specific properties over the latent 
space created by the dimensionality reduction process, as well as the possibility to 
solve the problem in a sound probabilistic way. 

A probabilistic interpretation of the problem allows for example to handle incom- 
plete data, to sample new data from the probabilistic model learned and to extend 
the model with prior knowledge or integrate it with other probabilistic models. 

Lawrence goes in this direction and propose a new dimensionality reduction 
approach based on Gaussian Processes: the Gaussian Process Latent Variable Model 
(GP-LVM) [57]. 
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(a) (b) 

Fig. 3.6: Graphic model of PPCA (a) and GPLVM (b) 

Lawrence shows that Principal component analysis (PCA) and Probabilistic PCA 
(PPCA) are nothing more that particular cases of GP prior with linear kernel. Using 
a GP prior with a non-linear kernel it is possible to face the problem of dimensionality 
reduction with non-linear mappings. 

Whereas the PPCA combines a Gaussian likelihood 

N 

p(Y\W,X,P) = \[N{y n \Wx n) (5- l I), (3.22) 

71=1 

with a Gaussian prior on the latent variables X marginalising over it: 

p(y n \W,J3)= [ P (y n \x n ,W,f3)p(x n )dx, (3.23) 



GPLVM does the opposite (Fig. 3.6) placing the prior on the parameters W of the 
mapping function p(W) = Y\ i= i-f^{wi\0,a~ l I) and marginalising over it: 

p(y n \X,(3) = Jp(y n \x n ,W,(3)p(W)dW (3.24) 

where consequently the solution for X can be found by assuming that y n is i.i.d and 
maximising the likelihood 

N 

p(Y\X,(3) = l[p(y n \X,(3) (3.25) 

n=\ 

and finally obtain a marginalised likelihood for Y: 

P(Y\X,0) = Hn=Jp(yn\Xn,W,(3)p(W)dW 



(2i<)^\K\-T 



exp(-|y T J fr- 1 y) 



(3.26) 



where K = aXX T + (31 and X = [x^...A / "J]. Maximising Eq. 3.26 is equivalent to 
minimising its negative logarithm 

DN 1h(2tt) - - In \K\ - -tr(K- l YY T ). (3.27) 



2 2 ' ' 2 
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It is possible to optimise the likelihood with respect to X with the gradient 

dL 



OX 
which implies the solution 



aK- 1 YY T K~ 1 X - aDK- l X, 



D 



YY T K~ 1 X = X 



and then with some algebric manipulation of this formula leads to 



X 



U q LV T 



(3.28) 



(3.29) 



(3.30) 



where U q is a N x q matrix whose columns are eigenvectors of YY T , L is a q x q 
diagonal matrix whose jth element is lj = (^ — o^)~ 5 and V is an arbitrary q x q 
orthogonal matrix. This eigenvalue problem can easily be shown to be equivalent to 
that solved in PCA, for this reason PCA inner products YY T can be replace by a 
non-linear kernel in order to extend PCA model with non-linear mapping using the 
same approach shown above. 

Lawrence suggests an RBF function (Eq. 3.12) as non-linear kernel and then the 
use of a scaled conjugate gradients (SCG) [58] for the task of non-linear optimisation. 

Furthermore, in order to make the algorithm computationally more efficient was 
used a sparsification process, sampling data points using informative vector machine 
(IVM) [59], which subsamples the observations sequentially according to the reduc- 
tion in the posterior process' entropy that they induce. 

GP-LVM gives a smooth mapping from latent to data space, however, whereas 
points that are close in latent space will be close in data space, points close in the 
data space may not be close in latent space. This is due to the intrinsically non- 
linear nature of the mapping that cannot maintain the correct distances between 
points. 
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Fig. 3.7: Examples of the application of back-constraint. The figure (a) show a 
latent space regressed without the use of back-constraint, whereas the figure (b) 
show the same latent space regressed with back-constraint. In the last case it is 
clear a smoother dynamics of the points. 
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To solve this issue Lawrence and Candela [60] propose a way to force GP-LVM 
in order to fulfil the second property. By back constraining each Xi to be a smooth 
mapping from yi local distances can be respected. As mapping it is possible to 
use any smooth function such as Neural network, RBF Network or Kernel based 
mapping. The constraints are of the form 

Xi = g{yi, W) 

with W parameters of the smooth mapping function. This constrains points that 
are close in the observed space to also be close in the latent space. The mapping 
from Y to X is called back- constraint. 

It is then possible to extend GP-LVM with the back-constraint computing the 
gradient J^, with L negative log likelihood of GP-LVM, via chain rule and optimise 
parameters of back-constraining mapping (Fig. 3.7). 



Chapter 4 
Face sensing 



A man's face as a rule says more, and more interesting things, 
than his mouth, for it is a compendium of everything his 
mouth will ever say, in that it is the monogram of all this 
man's thoughts and aspirations. 

Arthur Schopenhauer 



4.1 Introduction to the architecture 

Crucial element of this project is the detection and extraction of normalized faces 
from videos, namely face sensing. Without face pictures it is impossible to train 
an emotions recognition system, and obviously also the performance of the face 
extractor are an important factor to the success of the project. 

The architecture of the face extractor includes four main blocks (Fig. 4.1): 

Face detector - The first step for the extraction of facial cues from videos is 
to detect a face in it. Fortunately, the current state of the art let us to obtain good 
results with low computational costs, especially in good illumination conditions of 
the scene. 

Tracking system - This element is necessary in order to filter the errors of 
the measurement system (face detector) or to reconstruct missing measures with 
predictions. With a tracking system we are able to further improve the performance 
of face detection in a video sequence. 

Facial landmarks localization - It is necessary to have some reference points 
in order to recognize the pose of a face and to consequently extract a fixed and 
normalized area containing this one. Fortunately the human face has eyes, mouth 
and nose which can be easily used as reference points. 

Facial image normalization - Using the information of the facial landmark 
detector it is possible to compute a rigid transformation to align the eye line with 
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Fig. 4.1: Architecture of the face extractor system 

the X-axis of the image and scale it in order to obtain fixed position and size of the 
face elements. Furthermore, it is necessary the application of an algorithm able to 
normalize the images under different lighting conditions. In fact we assume that in a 
video shot the lighting condition are acceptable and more or less stable (see below), 
but this assumption does not consider the problem of different lighting conditions 
among all the videos in the database. 

In uncontrolled environments there are several variables that are difficult to man- 
age, for this reason some assumptions were made in order to simplify the problem: 

1. In a video shot is present only one subject. With this assumption it is possible 
to avoid the problem of identity measurement, which is necessary if we want 
to track several persons at the same time. Furthermore we assume that the 
subject is always present in the scene from the beginning of tracking; 

2. The subjects do not present beard or occlusions due to eye glasses, hair, hats, 
etc. This assumption lets us to have more precision during the process of pose 
estimation and affect recognition; 

3. The distance of the face from the camera remains more or less at the same 
distance for the entire duration of the video shot. Furthermore, the head 
remains most of the time in a frontal position. With this assumption it is 
possible to extract only faces of a sufficient scale dimension and only frontal, 
so that it is possible to obtain good facial cues for emotion recognition; 

4. The illumination of the scene is sufficient to easily detect a face, that means the 
restraint of shadows on the face and/or highlights. With this last assumption 
the performance of the face detector can be acceptable without the use of 
pre-processing methods for images enhancement. 



It is clear that these assumptions make us away from real life problems, however 
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addressing these issues now it would be premature and could be done in future works 
when the model for emotions recognition will be sufficient sound. 

All the components of the architecture have been programmed in C++ using 
OpenCV 1 libraries. In the following sections each component will be discussed and 
analysed. 

4.2 The face detector 

As has been said in the previous section, face detection is the first step in order to 
extract faces from video streams. Its performance have a decisive impact on overall 
efficacy. An optimal face detector should be able to locate all the faces present in 
an image regardless of different scale, orientation, pose, expression and so on. It is 
obvious that we are far from this optimal behaviour, however with the current state 
of the art we are able to obtain good results, especially in simplified and well known 
scene conditions, as in this work happens. 

Face detection can be accomplished in several ways. The easiest but less effica- 
cious way is using the information of skin colour [61, 62, 63]. These methods works 
well when the background colour of the scene is well separated by the colour of the 
subject's skin and when in the scene are not present objects with a colour similar to 
that of the skin, otherwise performance decrease drastically. Also lighting conditions 
can affect the results. On the contrary, these methods are able to consider different 
orientation, pose and size of faces without additional effort. 

Other methods consider the motion as a cue to extract faces from videos, for the 
reason that faces are usually moving objects [64]. These methods consider several 
frames to detect moving entity. However, faces are not the only type of objects that 
are able to move in a video stream; therefore these kinds of detectors must use also 
other approaches to discriminate among moving entities, otherwise they could fail 
the detection. A method to detect faces and not other moving objects is to detect 
a blinking pattern of the eyes to exclude other type of moving entities [65]. 

The last class of methods uses facial shape or facial appearance as cue for face 
detection. The input image is scanned at all possible locations and scales by a sub- 
window, then a trained face detector is used in order to classified the pattern inside 
the sub-window as face or non-face. In the works of Viola and Jones [66], techniques 
like integral image and training of the face detector using AdaBoost-based methods 
allow to further improve speed and accuracy of the detection, becoming the state of 
the art for face detection. 

4.2.1 Viola and Jones' face detector 

Viola and Jones' face detector classifies images based on the value of simple features, 
namely Haar basis functions. In details, these features are classified in three classes: 
two-rectangle features, three-rectangle features and four-rectangle features. The value 
of a two-rectangle feature is the difference between the sum of the pixels values within 
two rectangular regions. The value of a three-rectangle feature is the difference 



1 http : / /opencv .org/ 
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Fig. 4.2: Examples of Haar features. From left to right: Two-rectangle features (a), 
Three-rectangle features (b), Four-rectangle feature (c). 

between the sum within two outside rectangles subtracted from the sum in a centre 
rectangle. Finally a four-rectangle feature computes the differences between diagonal 
pairs of rectangles (Fig. 4.2). The regions of a feature must have the same size and 
shape and be horizontally or vertically adjacent. 

Viola and Jones introduce a new image representation called integral image that 
allows a fast evaluation of the Haar features during the detection process. The 
integral image at location x,y contains the sum of the pixels above and to the left 
of x,y inclusive. 

With the integral image previously computed, it is possible to calculate any 
rectangular sum using four reference points of the integral image itself, making 
the computation process very fast. For each considered sub-window there can be 
thousands of this kind of feature, which can be seen as mappings from a space NxN, 
the dimension of the sub-window, to scalars Zk(x)EM. These scalar numbers create 
an overcomplete feature set that can be used to train the system. 

The training process is taken by an AdaBoost learning procedure [67] that has 
the aim of learning the best sequence of weak classifiers h m (x) and the relative 
combining weights a m in order to obtain a strong classifier Hm defined as: 

M 

H M (x) = ^2a m h m (x) (4.1) 

So, the AdaBoost algorithm is used to solve three fundamental problems: (1) se- 
lecting effective features from a large feature set; (2) constructing weak classifiers, 
each of which is based on one of the selected features; and (3) boosting the weak 
classifiers to construct a strong classifier [68]. 

For each example into training set is associated a weight W{. This set of weights 
represents the distribution of the training examples. After each iteration, the train- 
ing examples that results harder to classify are given larger weights Wi in order to 
give more importance to these examples at next iteration. For further details about 
the AdaBoost algorithm see [67]. 

To further improve the accuracy of the face detector, Viola and Jones propose a 
trained cascade of strong classifiers instead of a one single trained strong classifier 
as a solution for reducing false alarm rate. The idea is to train a cascade of strong 
classifiers in this way: the first strong classifier is trained with all the positive and 
negative training examples, then the next strong classifiers are trained using non-face 
examples that pass though the previously trained cascade. 

When a sub-window has to be passed to the cascade in order to determine if it 
contains a face pattern or not, each cascade is questioned, then the features pass 
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though the next cascade node only if the previous node classify them as a face. If 
the sub-window is classified by all the cascade nodes as a face, the overall answer of 
the face detector will be positive. 

The power of this approach is that can be used not only for face detection, but 
also for other kind of objects detection. For this purpose the only things needed 
are a new training set with example of the object to detect and a new set of Haar 
features that can be effective for the detection task. For this reason, for example, it 
is possible to create several cascade file each one specialized in a particular pose of 
the face, such as frontal and profile faces. This is the approach of our work; several 
cascade files are used in order to detect a face even if it is not frontal, so that it is 
possible to miss a very few measurements, representing crucial information for the 
tracking systems in order to improve the overall performance of the face extractor 
system. 

4.3 The tracking system 

In order to improve the accuracy of the face detector and to remove noise due to 
missing measures (e.g. occlusions) or simply due to face detector failures, a tracking 
system can be used. 

A tracking algorithm has the aim of localizing a moving object inside a video 
stream. It is composed of two main steps: a first step of prediction and a second 
step of correction. 

In the first step the tracking system make a prediction of where the considered 
object could be. This prediction is made in a recursive manner taking in considera- 
tion the previous predictions and can be seen as a probability distribution with its 
mean and covariance, so usually it is used the mean as value of the location of the 
object. 

In the second step a correction to the previous prediction is made. This correc- 
tion takes into account the quality of the current measure and updates the current 
probability distribution of the states. 

There are several algorithms for video tracking, however the most used in com- 
puter vision are based on Kalman filter [69] or Particle filter (known also as CON- 
DENSATION algorithm [70]). 

The Kalman filter and Particle filter are based on similar ideas and probabilistic 
models, however Kalman filter works in an optimal way only for dynamic linear 
models and it assume that all error terms and measurements have a Gaussian distri- 
bution, whereas Particle filter generalize the model in order to capture also non-linear 
dynamics and non-Gaussian distributions. 

For this reason the Particle filter is more complete, however it is also computa- 
tional more expensive, so a Kalman filter is preferable if the necessary conditions 
subsist. 

In the case of face tracking these conditions can be valid only simplifying the 
problem with further assumptions. The first assumption is that the errors and the 
measurements must be Gaussians and the second assumption is that the subject in 
the video does not compute extreme movements with its head or occlusion of the 
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face. 

We can easily assume that for each frame the position of the subject (state) can 
be represented by a Gaussian distribution, where the mean indicates the current 
location of it. Also the errors can be interpreted as Gaussian noise, so that the first 
assumption is solved. 

The problem is on the second assumption; in fact it is possible to assume a linear 
dynamics of the subject's face only in controlled conditions. In real life a subject 
can for example sneeze or compute a rapid and non-linear movements with its head, 
creating serious problem for a Kalman filter based tracker. Not only rapid non-linear 
movements can reduce accuracy of a Kalman filter, but also occlusions of the face 
that in real life are more common (consider for example the occlusion created by 
the hands on the face when the subject is stressed or tired). 

In this work are considered relatively simple videos with only one subject com- 
puting simple movements generalizable with linear dynamics models. Sometimes 
occlusions of the face happen, but are sporadic. For this reason a Kalman filter 
based tracker can be a good solution for our purpose, however in order to face possi- 
ble future works where the two assumptions fails also a Particle filter based tracker 
was considered and then an evaluation of accuracy of the two tracker was made. 

In the next two paragraphs will be introduced the theoretical fundamentals of 
Kalman and Particle filter and the relative tracking systems used in the project. 
Then a further paragraph will show the results of a test for the accuracy of the two 
tracking systems. 

4.3.1 Kalman filter based tracker 

The Kalman filter has the aim to estimate the state sGlR" of a discrete time process 
governed by the equation: 

s t = Ast-i + But-i + wt-i (4.2) 

where 

A is the transition matrix of the model and it is applied to the previous state 
St-i- This matrix correlates the state at time t — 1 and the state at time t and it is 
responsible of the state update. 

B is the control matrix on the input of the system. It is applied to the control 
vector iit-i£.M. k and maps it to the dimension of the state s. 

Wt€.M. n is the matrix of the noise on the state. It is assumed as a Gaussian with 
mean and covariance described by the matrix Q t . 

At time t an observation of the real state s t is made through the measure vector 
zEW 71 that is modeled by: 

zt = Hs t + v t (4.3) 

where 
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H is the matrix mapping the measures space to the state space, that is the ma- 
trix H describe how the measure is created starting from the state. 

■y t GM m is the noise of the observation and it is described by a Gaussian with mean 
zero and covariance described by the matrix R t . 

Kalman filter is a recursive estimator where the phases of prediction and correc- 
tion occur in a cyclic way. The predict phase projects the state and the covariance 
of the state from time t — 1 to time t. The equations for this step are: 

s t = Ast-i + But-i (4.4) 

P t = AP t ^A T + Q t _! (4.5) 

Where P is a matrix describing the covariance of the error on the estimation of the 
state. In the correction phase the a priori estimation of the state obtained in the 
predict phase and the new observation are used in order to correct the prediction. 
The equations for this step are: 

PtHjiHtPtHj + RtY 1 



K t 



Pt 



s t = As t + K t (z t - H t As t ) 

(/ - K t )Pt(I - K t f + K t R t Kj 



(4.6) 

(4.7) 
(4.8) 



The entire process can be summarized as: 

1. Project the state ahead (eq. 4.4); 

2. Project the error covariance ahead (eq. 4.5); 

3. Compute the Kalman gain (eq. 4.6); 

4. Update the estimation with measurement z t (eq. 4.7); 

5. Update the error covariance (eq. 4.8). 

In this work the state is described by the x,y position of the face into the frame of 
the video and the size w of the square window containing the face. 

Assuming a constant speed for each of the 3 component of the state and the 
absence of control input, it is possible to describe the vector s and the matrix A in 
this way: 



x 

y 

w 
dx 
dy 
dw 



A 



10 10 
10 10 
10 1 
10 
10 
1 



As measure system it is used the same Viola and Jones' face detector. This may seem 
counterintuitive and the same thing of detecting the face frame by frame, however 
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it is important to remember that in this phase we are modelling a filter and that 
the measures are affected by noise and that sometimes are missing, so the Kalman 
filter allow us to correct these measures (and to reconstruct the more probable state) 
in order to have better performance. Consequently, the measure vector z and the 
transformation matrix H are described as: 



H 



10 
10 
10 



4.3.2 Particle filter based tracker 

Particle filter, like Kalman filter, has the aim of estimating the state of a movable 
object, but unlike Kalman filter, this algorithm results optimal on the tracking of 
non-linear trajectories or dynamics affected by non-Gaussian noise. 

Also this algorithm relies on a two steps procedure of prediction and updating. 
These steps derive from Bayes rule which describe the filtering distribution though 
the likelihood p{z t \s t ) and predictive density p(s t \zi : t-i), so that: 



p(s t \z 1:t ) ccp(z t I s t )p(s t I z 1: 



t-1, 



(4.9) 



Usually the likelihood function is known, whereas the predictive density is not, 
for the reason that it is given as an hard to compute integral: 



p(s t \z 1: 



t-ij 



p(s t \s t _i)p(s t ^i\z l , t _ l )ds t . 



(4.10) 



Analytical solutions are available only in a few simplified cases, for example when 
the model is linear and Gaussian (this is the example of Kalman filter). However, in 
general p{st-i\z\:t-i) is a complicated function and simulation methods, like Monte 
Carlo methods, are required. 

The method uses a sample-based approach to estimate the probabilistic distri- 
bution of the state (Monte Carlo method). The posterior probability is represented 
by a set of randomly chosen weighted samples, so that: 



P(s t \zi:t) « ^Wfflsoa - 8 % 0:t ) 



i=l 



and 



A f s 



J2 Wi = 1 



i=\ 



I AT, 



Where 5 is a Dirac function and {iUt> s t}i=i denote a set of weights and particles. 
It is then possible to approximate the integral in equation 4.10 with a sum of a 
discrete number of particles. Expectations such as E(s t \zi :t ) can be easily computed 
as Monte Carlo averages: 



1 N 



(4.11) 
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In order to instantiate the model we needs: an observation equation p(z t \s t ) 
that represent the likelihood function of the observations with respect to the state 
distribution, a state evolution p(s t \st-i) that define how the state evolve over time, 
and an initial state distribution p(so). 

There are several methods to simulate the sampling of particles, one of these 
is the Sequential Importance Resampling (SIR). This method is composed of three 
steps: 



1. (Propagation) Draw s\ ~ p(st\s\_ x ) for i — 1, ...,N 



2. (Resampling) Draw s\ ~ MuUn({wI} 



i^N 



i=l/ 



3. (Importance normalization) 



p(*t\4) 



EiIiP(z*N) 

The first step propagates the state of each particle in a new state through the 
defined state evolution function. The second step resamples the particles using the 
multivariated distribution defined by the discrete weights at previous time. The 
third step establishes the new normalized importance values of the weights through 
the likelihood function. 

Since the face detector proposed by Viola and Jones can only produce binary out- 
put (face or non-face), it is not possible to determine a likelihood of the measurement 
system in a probabilistic way in order to fit the particle filter model. 

Boccignone et al. [71] propose a methodology to determine a probabilistic model 
based on the Viola and Jones' face detector. In this work was used the same approach 
and for this reason we refer to [71] for further information on the implementation 
of single details. However, to further improve the accuracy of the detector with 
occlusions and non-linear motion, the evolution of the particles was not modulated 
with additive Gaussian noise as the article suggests, but a different approach based 
on optical flow was used. 

First the face is detected by the face detector, then the propagation of the par- 
ticles is made using the speed and direction of the optical flow of the face window 
predicted at time t-1. In this way, unlike Kalman filter, we can give to the particle 
a non-linear dynamics. Furthermore, we are able to track the motion of the face 
also if it is partially or completely occluded by hands or other objects, in fact the 
optical flow allows the evolution of the particles also without the detection of the 
face, because based only on the assumption of brightness constancy of the pixels 
over time. 

4.3.3 Comparative tests and results 

To estimate the quality of the two trackers presented, were made 3 tests. The 
first test considers a subject moving its head with linear motion, the second test 
involves a subject moving its head with high speed non-linear motion, the third test 
investigates a subject moving its head with linear motion but sporadically occluding 
the face with its hands. 

For each test both the tracker processed the three videos and then the number of 
hits and miss were computed. For hit we intend a window comprehending the face 
with visible eyes, nose and mouth. The following table shows the results (Tab. 4.1). 
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Kalman hits 


Particle hits 


Test 1 
Test 2 
Test 3 


96% 
93% 

78% 


92% 

97% 
82% 



Tab. 4.1: Results of the comparative tests on the two trackers 

In the first test are immediately visible the better performance of Kalman filter 
with respect to Particle filter. In fact, in addition a higher frame rate, Kalman 
filter results more stable and has more hits with respect to Particle filter, in details 
Kalman filter has 96% of hits while Particle filter has 92%. 

In the second test it is possible to see the difficulties of Kalman filter on non- 
linear dynamics. In this case, Kalman filter has 93% of hits due to the loss of most 
of the frames with abrupt changes of direction, while Particle filter has 97% of hits 
losing only some of the frame with highest speed. 

In the third test Particle filter proves once again superior to Kalman filter. The 
hits of Kalman filter in this case are 78% with no hits during the occlusion of the 
face, while the Particle filter has 82% of hits, with only a small portion of frames 
with occlusions missing. Overall it can be seen that: 

1. Kalman filter is optimal for linear dynamics; 

2. Both the trackers show a high percentage of hits; 

3. The task with the highest number of missing is the tracking of the face partially 
or completely occluded. 

For these reasons the decision was to use Kalman filter at least for this first part 
of the project, because in controlled situations non-linear dynamics and occlusions 
are hardly present. However it may be necessary in the future the use of Particle 
filter and other improvements in order to face real-life problems. 

4.4 Facial landmarks localization 



The aim of facial landmarks localization is to find the accurate positions of the 
facial feature points such as the corners of eyes and mouth or the centre of nose 
(Fig. 4.3) [68]. 

In the last decades many attempts were made in order to fulfil this task. Early 
researches extracted facial landmarks based only on geometrical knowledge of the 
face, for example with the use of contours detection and splines [72]. These model- 
independent algorithms led to poor results and for this reason the researchers start 
to focus on model-dependent algorithms. First attempts in this direction were made 
using rigid face models labelled with facial components [73]. Since these face shape 
models were not based on statistical learning, their results were once again unsatis- 
factory. 

First successes were due to the Active Shape Model (ASM) [74] and Active 
Appearance Model (AAM) [75]. In these models, face shape is modelled as a linear 
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Fig. 4.3: Example of facial landmarks after the detection of a face 




Fig. 4.4: Example of an AAM 
model for face landmarks localiza- 
tion 



combination of principal modes learned by examples of training face shapes in order 
to learn a deformable shape model through statistical distribution of shape and 
textures (Fig. 4.4). 

With this deformable shape model it is then 
possible to extract objects with similar shape to 
those in the training set by fitting the deformable 
model to images. 

For their flexibility and good performance, 
ASM and AAM became soon the most popular 
models of facial landmarks detection. However, 
at now still considerable difficulties are encoun- 
tered on this task, especially when the face im- 
ages are in uncontrolled situations. For example, 
it is difficult for an AAM model to correctly fit 
both frontal and profile (or semi-profile) faces. 

The reason could be that the model is still to much rigid to generalize several poses, 
in fact it concentrates on a global vision of the appearance or shape of the face and 
not on the subcomponents that compose it. 

Deformable Part Models (DPM) take in consideration this issue finding in a first 
step a match of the whole object, and then, using its part local appearance models, 
they fine-tune the results minimizing a deformation cost. 

The model can be viewed as an undirected graph with a series of vertices relative 
to the parts of the object and edges between related parts of the object. For example 
in the case of a face a DPM could be described as a graph in which the vertices are the 
eyes, the nose and the mouth, and the edges are between eyes and nose, and between 
nose and mouth (Fig. 4.5). Then, the complexity of this algorithm results related to 
the structure of the graph, in particular, acyclic graphs allows efficient estimation 
by Dynamic Programming. With DPM it is possible to fuse the local appearance 
model and the geometrical constraints into a single model, further improving the 
quality of the facial landmarks detector. 

A work of Uficaf et al. [76] goes in this direction in order to fulfil the detection 
and localization task of facial landmarks. Uficaf treats the landmarks detection 
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Fig. 4.5: Underlying graph for the landmark configuration 

as an instance of the structured output classification problem, using a Structured 
Output SVM (SO-SVM) [77] as the algorithm for learning the parameters of the 
landmarks detection from the training set. As appearance model he proposes to 
use a Local Binary Patterns (LBP) [78] pyramid, that is famous in computer vision 
for its ability to represent textures and easily allow the detection of similar ones. 
Finally, as deformation cost function was introduced a quadratic function gij(si,Sj) 
of displacement vectors Sj — Sj defined as: 

tyfj(si, Sj) = (dx,dydx 2 ,dy 2 ) 
(dx,dy) = (xj,yj) - (x»,yi) 

In the article were also presented a set of experiments evaluating the performance of 
the facial landmarks detector. In particular, it is been shown that the performance of 
this detector based on DPM are better than detector based on A AM. Furthermore, 
Uficaf released a free version of his detector written in C++. 

For these reasons in this project was used the facial landmarks detector proposed 
by Uficaf et al. [76] . However to further improve the precision of the facial landmarks 
detector, was added to it a Kalman filter (see Section 4.3.1) in order to decrease the 
noise on the detection process. 

4.5 Facial image normalization 

Detecting the faces in the video and tracking their movements are not sufficient in 
order to extract pictures of faces directly usable to train a regressor; in fact, the face 
could change the pose or a different illumination could create different colour of the 
skin, adding noise that may reduce the performance of the regressor. Furthermore, 
faces have different vertical and horizontal sizes and position of eyes, mouth and 
nose are not fixed among people. 

To solve the first problem of different poses were used the facial landmarks in 
order to align the face. It was assumed that the faces of interest are only frontal, 
so every face detected by the profile face detector was discarded and used only as a 
measure for the face tracker. 

Considering a right-handed coordinate system, with the origin at the sensor, Z 



CHAPTER 4. FACE SENSING 



46 





Roll 



^ ^ ] ' Pitch 




l» „ Yaw ^ „ *» 



Fig. 4.6: Possible movements of a face around X, Y and Z axis (Pitch, Yaw, Roll) 



pointed towards the subject and Y pointed up, a frontal face is able to move around 
the Z-axis (Roll), around the X-axis (Pitch) and a little around the Y-axis (Yaw) 
(Fig. 4.6). For each one of these possible face dynamics it was chosen a specific 
solution. 

Normalize the pose of a face rotated around the Z-axis (Roll) it is a quite simple 
task. It is sufficient to consider the position of the centre of the two eyes in order 
to estimate the angle p between the line passing through the centres of the two eyes 
and the X-axis of the image (Fig. 4.7). 

The positions of the centres of the 
eyes are available thanks to the facial 
landmarks detector. Using the position 
of the eyes corners, it is possible to esti- 
mate the centres as the mean of the two 
eyes corners. To estimate the angle p it 
is possible to use a simple equation: 




p = Atan( — 



e' — e; 



180 



7T 



(4.12) 



Fig. 4.7: The roll angle p 



Where [e r x , e r y ) are respectively the 

x and y coordinates of the centre of the 

right eye, and (e l x , e l y ) are respectively 

the x and y coordinates of the centre of the left eye. Given the p angle estimated 

with equation 4.12 it is possible to rotate the face with an affine transformation on 

the face image. 

The normalization of a face rotated around X and Y axis is a more difficult task. 
In fact it is not only necessary to estimate the rotation angles, but we need also a 
transformation of a 2D image that simulates a transformation in a 3D space, because 
with the projection of the 2D image we have lost a dimension and with it important 
information for the correct reconstruction of the normalized pose. 

For this reason we decided to collect information about the current pose of the 
face and simply discard faces with clues of large rotation angle around X and/or Y 
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axis. 



When a face is frontal, the proportion of the lengths of the two eyes is next to 



one: 



l°l; °2l 

e C 

i 1 ' e 2 I 



(4.13) 



A rotation around the Y axis (Yaw) produces a different proportion no more next 
to one. Observing this behaviour, frames with eyes ratio outside the range 1 — r 
and 1 + r were no more considered. Empirically the r was set to 0.15. 

To detect and then discard faces with pitch was used an eyes and mouth detector, 
based again on Haar feature and cascade of strong classifiers trained this time with 
pictures of eyes and mouths. Since the classifier was trained with frontal images 
with a less degree of rotation, faces with a large pitch angle do not give sufficient 
clues to the detector in order to detect the eyes and/or the mouth of the subject. 
Pictures that do not pass this test were discarded. 

This last step it is important not only for discarding faces with large pitch angle, 
but also to discard pictures not containing correct faces with clear eyes and mouth, 
that are necessary in order to estimate the affective state of the subject. 

When a face picture pass all these tests, it is necessary to normalize the size of 
the window and the relative position of the facial landmarks. 

The face window was fixed to 90x110 pixels and the image was scaled in order 
to normalize the eyes baseline of the subject was to 70 pixels of length and the 
vertical distance between eyes and nose to 31 pixels (Fig. 4.8). Then the window 
was cropped around the face. In this way we are able to obtain pictures of faces 
with normalized proportions. 

The next step is the illumination 
normalization. To fulfil this task was 
used a particularly robust algorithm 
proposed by Tan and Triggs as a pre- 
processing step for a face recognition ap- 
plication [79]. 

The first step is to apply a gamma 
correction on the gray-level image /. 
The gamma correction process provide a 
non-linear transformation that replaces 
/ with I 1 with 7 > 0. The effect is the 
enhancement of the local dynamic range 
of the image in dark or shadowed regions 
and the compression of it in brighter re- 
gions. The author suggests a gamma with value 7 = 0.2. 

The second step uses a difference of Gaussian (DoG) as a filtering to remove 
shading effects; in fact the gamma correction is not sufficient in order to remove 
the shadings on the face texture. Using a DoG it is possible to implement a band- 
pass filter that removes low frequency information, such as the shadings, and high 
frequency details such as the noise on the image. The problem is to determine how 
much have to be wide the inner band. The author suggests to use ctq = 1.0 and 
<j\ = 2.0 as values of the sigma relative to the two Gaussians. 



HOpx 




Fig. 4.8: Normalized sizes of the face 
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The final step globally rescales the image intensities to standardize a robust 
measure of overall contrast. To accomplish this task it is necessary to take in account 
that the image could contain small portions of extreme values due to highlights, 
garbage at the image borders and small dark regions such as nostrils, so it is crucial 
to use an estimator robust to this kind of noise. It is eventually possible to use a 
mask for useless portion of the image. The author suggests a two stage process: 

I(x, y) < ^^ r (4.14) 

(avg(\I(x,y)\ a ))« 

I(x, y) < ^^ r (4.15) 

(avg(rmn(r, |/(x,y)|) Q ))- 

Where a is a strongly compressive exponent reducing the influences of large 
values and r is a threshold with the function of truncating large values after the 
first step of the normalization. The values suggested are a — 0.1 and r = 10. 

To further remove extreme values still contained after the normalization process, 
was used a hyperbolic tangent to compress values to the range (—t,t): 

I(x, y) <- rtanhi 1 -^^) (4.16) 

r 

Then a scaling of the values of the image pixels is done in order to have the 
values in the range (0, 1). 



Chapter 5 

Preliminary tests and results 



Science is nothing but perception 



Plato 



5.1 Introduction to tests 

The aim of this chapter is to investigate with some preliminary tests the validity of 
the choices made in the previous chapters. Our approach is quite innovative and over- 
turns the vision of the present researches on emotions recognition. In fact whereas 
the majority of present researches focus on the idea that the observation with a 
specific affective power generates an affective latent variable, our vision investigates 
the opposite behaviour, namely it is the affective latent variable that possesses a 
specific affective power and consequently generates a congruous observation. 

For this reason, the current datasets for emotions recognition present labelled 
observations, where these labels could be seen as the representation of observations' 
affective power. Although with these datasets it is possible to regress a latent space 
with several supervised machine learning techniques, the topology of the resulted 
space is biased by the representation used to describe the emotional power of the 
observations in the dataset. Nevertheless in previous chapters we told several times 
that currently there is not a shared theory among the psychologists about the rep- 
resentation of emotions, so this bias could be simply wrong. 

Our idea allows to solve this issue, since the method used for the regression of 
the latent space is not supervised, but based on an unsupervised probabilistic di- 
mensionality reduction. However, the downside of this approach is that objective 
evaluations of the performance are harder to make, for the reason that the topology 
of the latent space is described by a set of not always clear axes, whereas the obser- 
vations used for the training process are usually described by a set of well defined 
labels, so that each attempt to compare them becomes a difficult task. 

There are at least two ways to make objective evaluations of the performance with 
our approach: the first involves the use of several subjects evaluating the similarity 
between the observations in the dataset and the observations generated by the latent 
space with a likert scale or similar approaches; the second requires a training set and 
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a test set where each observation is labelled with specific characteristics of the facial 
expression in order to allow one-to-one relationships between the two sets, which 
could be used to estimate the distance in the latent space between the position of 
the observation in the training set with that in the test set. 

The first method presents as advantage the possibility to leave the evaluation 
process to human subjects representing the future users of the application and having 
first important information on the ability of the model to recognize emotions as 
humans do. The disadvantage is that, in order to produce statistically significant 
evaluations, many subjects are needed differing in ages, sex, cultures, education, and 
so on, which is not always simple to arrange, at least not in a short period of time 
like that available to develop this preliminary work. 

The second methodology has as its major pro the ability to produce numerical, 
sound and objective results; however it is not easy to have datasets with a one-to-one 
relationship between the two sets of observations. For example this is not possible 
when the labels comprehend several shades of the concept that they describe (e.g. 
the classes of the basic emotions). Furthermore, when the dataset has this charac- 
teristic, it is still possible to have misalignments between the two facial expressions 
under evaluation, due to the different evolution of these over different times. 

This thesis focuses on a preliminary observation of the classification behaviour 
of the proposed model based on GP-LVM. For this reason the following tests have 
the sole purpose to give us high level information to determine whether or not the 
proposed approach could be spent in future works, providing some useful guidelines 
for its potential improvements. As consequently the results we achieve are not 
sufficiently relevant for a scientific and sound evaluation, we will devote future works 
to improve their quality. 

For the following tests and their evaluations it was used the MATLAB® code 1 
created by Lawrence with several useful functions for the use of GP-LVM. This 
algorithm needs to specify the number of active points, namely the number of points 
selected by the IVM algorithm, the number of maximum cycles of optimization, and 
obviously the desired dimensionality of the latent space. 

5.1.1 Datasets and issues concerning data collection 

Major difficulty for affective computing applications is to collect good quality data 
in order to use it for the training process and then to evaluate the performance. 
This difficulty is due to several causes, most of which were already told in previ- 
ous sections. However for greater clarity they will be listed above as a brief summary: 

The concept of emotion under investigation There are several concepts 
compatible with the term emotion. For example we can intend the current internal 
affective state of the user, a long term dynamics of affective states or simply the 
external feedbacks given by the subject in a precise moment. In the first two cases 
it is difficult to produce good data quality, for the reason that the only way to try to 
measure the internal affective state of a user is to place on it annoying and invasive 



1 http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/fgplvm/ 
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sensors, which augment the risk of biased data, and consequently to infer the affec- 
tive state [80]. Furthermore current sensors are not so precise, therefore the data are 
very noisy. In the latter case the data collection is simpler, but not sufficient easy. 
In fact we can use non-invasive sensors like a camera and a microphone; however 
it is still possible to collect noisy data due to the insufficient quality of the sensors 
or experimental conditions, especially when data are collected in uncontrolled situ- 
ations. 

Acted VS. non- acted emotions - It is possible to choose between acted and 
non-acted emotions. In the first case a subject is asked to produce a particular 
facial expression and/or speech in order to stress a particular emotion. In the sec- 
ond case the external signals given by the user are more natural, because they are 
induced with particular techniques or are filmed in secret during a particular so- 
cial interaction. The former are simpler to collect but also poor of information, 
for the reason that often the expressions are unnatural and/or not very prominent. 
Furthermore acted data usually do not include the full range of possible emotions 
available in a common social interaction. The latter are more informative being data 
collected from real life, but they suffer of a severe noise, due for example of different 
face poses, non-uniform light conditions and lips movements during the speech that 
could produce fake emotional data. 

Labelling process - The labelling process is not a simple task; determine a la- 
bel for a particular affective state is usually subjective. More objective methods of 
labelling imply rigid classes and pattern schemas, causing the creation of datasets 
often unsuitable for most real life applications. A good labelling approach should 
provide an objective labelling procedure maintaining in the same time a wide range 
of possible schemas. 

Taking into account these difficulties, the data were initially collected from a set 
of videos available on Youtube 2 . This collection includes six subjects, three males 
and three females, filmed during a real interview or during an acted 3 monologue and 
covering a wide range of affective states and facial expressions. With this choice, 
it was possible to evaluate performance directly on real life data; however labels 
were not present, consequently this issue created difficulties during the evaluation 
process. 

Considering the problems of the previous dataset, it was contemplated the use of 
a laboratory dataset. For several reasons our choice was oriented on the MMI- Facial 
Expression Database collected by Valstar and Pantic [81]. The power of this dataset 
is that almost each frame of the videos in it is labelled with the Action Units present 
on the face of the subject, and, even if the facial expression is acted, it is possible to 
cover a wide range of facial expressions (although only single or simple combinations 
of them were considered for each video). 



2 http://www. youtube. com 

3 Acted but not constrained, so the facial expressions appear to be highly naturalistic and 
spontaneous 



CHAPTER 5. PRELIMINARY TESTS AND RESULTS 



52 





anger 
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Tab. 5.1: Details of the web- video dataset 



5.2 Dataset of heterogeneous subjects from Web 





(b) Subject B 





(d) Subject D 



The Web contains a lot of information: most of them are unfortunately unstructured 
or semi-structured; nevertheless the quantity of data available in it is often superior 
to that possibly collected in a laboratory. 

Our idea involves the collection of several 
videos from Youtube and consequently the ex- 
traction of the included facial expressions to 
train the model. This decision born from the 
conviction that the stronger deficit of actual fa- 
cial expressions databases is the unreality of fa- 
cial expressions in them, the absence of impor- 
tant social signals, and often the aseptic charac- 
ter of the included facial expressions. Accord- 
ingly, collecting a set of videos of real interviews 
or highly spontaneous monologues it seemed to 
us a good choice to stress the emotional charac- 
ter of the faces. 

The dataset includes six subjects: three 
males and three females (Fig. 5.1). Each video 
includes several facial expressions; therefore to 
cover all the range of emotions we tried to collect 
faces exhibiting all the six basic emotions and 
their shadings. Unfortunately the footages con- 
taining emotions like disgust and fear were diffi- 
cult to find and for this reason they are present 
only in a small portion of frames. In Tab. 5.1 we 
summarize the emotions expressed by the sub- 
jects; obviously the classification of the frames into these six basic classes of emo- 
tions is reductive, but it was necessary as a guideline in order to create a rather 
balanced dataset. 

The major difficulty for this kind of dataset is that the environment conditions 
were not controlled, so light conditions can change during the shot and also the pose 
of the face. Furthermore some facial occlusions are sporadically present. 

Another problem is that the footages are long, and this produces several frames 
for each subject, causing on the one hand a dimensionality problem (the memory 




(f) Subject F 

Fig. 5.1: Subjects of the web- 
video datasets 
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Fig. 5.3: Examples of unnatural faces generated from latent space's points 

necessary to process the whole data) coupled with a lot of redundancy, and on the 
other hand a problem for the regression with GP-LVM, since this model is known 
to produce good regressions with training set of small dimension [82]. To solve this 
problem, for each subject, the 150 most informative frames were sampled with the 
use of the IVM algorithm presented in Section 3.5. After this sampling process our 
dataset included 900 frames in total. 



5.2.1 Pilot test 

The aim of this first test was to verify the properties of the latent space regressed 
with GP-LVM after few cycles and with the use of a small amount of active points 
in order to get familiar with this dimensionality reduction tool. 

The number of maximum cycles and the number of active points were set to 100 
and the dimensionality of the latent space to 2. The latent space resulted is shown in 
Fig. 5.2, where the identity of each subject in the training set is drawn in a different 
colour and/or shape. 

The resulted latent space shows 
a rather well separated iden- 
tities of the subjects, even if 
it was not our aim. In fact 
our need is to separate dif- 
ferent facial expressions, and, 
considering that a similar fa- 
cial expression is usually pro- 
duced by several subjects in the 
dataset, this imply that the la- 
tent space should present the 
identities more mixed up in or- 
der to fulfil this aim. Fur- 
thermore, the points of the la- 
tent space generates noisy ob- 
servations of the data space, re- 
sulting in unnatural faces and 
causing non-smoothing dynam- 
ics (Fig. 5.3). 

To solve the first problem a possible solution is to augment the number of active 
points used during the optimization process in order both to have more sparse facial 
expressions and to augment the chances of generating clusters of them; whereas to 
solve the second problem it is surely convenient to highly augment the number of 




Fig. 5.2: The latent space generated after 100 cycle 
and using 100 active points 
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Fig. 5.5: Examples of classification of new observations. In the first row the new 
observed data are reported, whereas the second row contains the data generated by 
the model according to the likelihood of the new observation w.r.t. the latent points. 



cycles. 



5.2.2 Varying the number of cycles and active points 

In this test the number of cycles was augmented to 1000 and the number of active 
points to 300. The resulted latent space is shown in Fig. 5.4. 

The identities are further 
grouped into well separated 
clusters, which makes difficult 
to regress the similar facial ex- 
pressions among the subjects of 
the dataset. This means that 
augmenting active points is not 
the right answer to this prob- 
lem. 

On the contrary, the noise 
on the observations generated 
by the latent points was dras- 
tically reduced, causing smooth 
dynamics among close points 
of the space and more realistic 
faces. 

To verify the ability of the 
model on classifying new obser- 
vations, we sampled 10 facial expressions from another video available on the Web 
containing facial expressions similar to those used for the training process, but with 
different subjects. Each new observation was classified by the model according to 
the likelihood of the new data w.r.t. the latent variables. The results are shown in 
Fig. 5.5. 

It is possible to qualitative see that the classification process fails and that it 
suffers from a bias on the pose of the face, especially the face shape. These results 
confirm that the identity, and consequently the dimension and shape of the face, 
drastically reduces the ability of the model to classify new observations. 




Fig. 5.4: The latent space generated after 1000 
cycle and using 300 active points 
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(b) (c) 

Fig. 5.6: Example of application of the mask on a subject. In (a) the frame before 
applying the mask, in (b) the mask and in (c) the frame after the application of the 
mask. 



5.2.3 The identity problem 

To solve the problem occurred in the previous test (Section 5.2.2) the only way is 
to hide the useless parts of the faces that could produce a bias on the identity of 
the subjects. A possible approach is to use as features for the classification only 
the positions of facial landmarks and/or splines describing the shape of mouth and 
eyes. However, in our opinion, this approach is not sufficient to cover the complex 
structure of a non-trivial facial expression, since information such as wrinkles would 
be lost. 

A second possible scenario is to cover with a mask the pixels of the face useless for 
emotion recognition, trying to hide as more as possible the identity of the subject 
without occluding important cues of its facial expression. To fulfil this task the 
mask in Fig. 5.6 was proposed and a new model was trained with only the pixels 
not covered by the mask. Also in this case the cycles was set to 1000 and the active 
points to 300, regressing the latent space in Fig. 5.7 (a). 





Valence (- 



(b) 

Fig. 5.7: The latent space generated after 1000 cycle with masked frames (a) and 
the relative topology accordingly to that of core affect proposed by Russell (b) 



This new latent space has the identities more mixed up. By investigating the 
dynamics among close points in the latent space smoother dynamics in the facial 
expressions are observed, even if the identity changes during the movement on the 
selected path. 
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£t 4 *f 

(a) Negative valence and passive arousal (b) Positive valence and passive arousal 

(c) Negative valence and active arousal (d) Positive valence and active arousal 

Fig. 5.8: Samples from the four quadrants of the generated core affect 



Furthermore the latent space shows an interesting topology. It seems that 
there is a correlation between the core affect space theorized by Russell (cfr., Sec- 
tion 2.2.6) and our space generated by GP-LVM, since the four quadrants simplified 
in Fig. 5.7 (b) are visible like those in Fig. 3.1. 

Fig. 5.8 shows 5 observations for each area of the regressed core affect space. Anyway 
the subdivision in four areas is not always well distinct, as proved by the almost non- 
linear boundaries of the arousal. Conversely the valence is more precisely divided. 

To test the classification performance on this new space we used the same obser- 
vations of the previous test (cfr., Section 5.2.2). The results are shown in Fig. 5.9. 

Although still not satisfactory, the current results are more accurate than the 
previous test's. The problem was largely due to the quality of the dataset. First of 
all the number of facial expressions present in the dataset is high, but not sufficient 
to cover the whole range of emotions. This in turn is due to the difficult task of 
collecting videos of single subjects exhibiting a series of facial expressions without 
large changes of pose and with a good quality of the footage. The second problem 
is related to the repetition of the facial expressions in the dataset: the different 
identities cause a separation over the space of these similar facial expressions even 
if they should be placed in close areas. 



FT.T^TT 



T ▼ T ▼ T ▼ T ▼ ■ l 



Pi 4 * 4 ± 4 i 4 l 4 ± 4 i 4 ± 4 * 4 l 

■TT w ▼ ^-T ▼ Y.T.-.T ▼ T ▼ T ^-Y,T T^l 

Fig. 5.9: Examples of classification of new observations. The first row reports the 
new observed data, whereas the second row contains those generated by the model 
according to the likelihood of the new observation w.r.t. the latent points. 
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(a) Subject (b) Subject 1 

Fig. 5.10: The subject of MMIDB used for the tests 



5.3 MMI Facial Expression Database 

The MMI-Facial Expression Database [81] contains more than 1500 samples of both 
static images and image sequences of frontal and profile faces exhibiting various facial 
expressions, single Action Unit activation, and multiple Action Units activation. An 
Action Unit (AU) describes the activation of a single facial muscle as defined in the 
Facial Action Coding System (FACS) [54]. FACS is a system designed with the aim 
of giving an objective description of a facial expression changes in term of observable 
facial muscle actions (Fig. 5.11). This system provides rules for the visual detection 
of 44 different AUs and the relative temporal segments (onset, apex, offset). 

The subjects included in the database 
are 19 and they have different gender, 
race, age and facial characteristics such 
as beard, glasses and moustache. Each 
subject produced a series of footage each 
containing either a single AU or a com- 
bination of a minimal number of AUs 
(when for instance a single AU cannot 
be displayed alone). 

The database includes both frontal 
and profile faces; the subjects were 
asked to display the required expressions 
while minimizing out-of-plane head mo- 
tions. 

Most of the frames in the database 
was described in terms of displayed AUs; 
moreover the access to the database is 
web-based with the possibility of filtering the data with a specific query 4 . 

For our purpose two subjects were used: the first (Subject 0) for the training 
process and the second (Subject 1) for the test process (Fig. 5.10). To reduce redun- 
dancy and dimensionality of the frames of Subject 0, we subsampled each footage 
selecting only the frames with the AU on apex. Consequently, considering that 
these frames represents a small fraction of the total, the training set size decreased 
drastically to approximately 900 frames (Fig. 5.12). 
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Fig. 5.11: Examples of Action Units 



http://www.mmifaccdb.com/ 
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Fig. 5.12: Examples of facial expressions contained in the dataset selected for the 
training process 



5.3.1 Pilot test 

As for the previous dataset, the aim of this first test was to verify the regression 
ability of GP-LVM with the dataset collected from the MMI database. Since the 
actual and the previous datasets have more or less the same amount of training data, 
the number of cycles was set to 1000, the number of active points to 300 and the 
dimensionality of the latent space to 2. The resulting space is shown in Fig. 5.13 
where the different combinations of Alls are drawn with different colours and/or 
symbols. 

From Fig. 5.13 it is evi- 
dent that some AUs combina- 
tions are well separated in the 
space, whereas others are mixed 
up in a single area. Due to 
the large number of classes (52), 
the specific AUs not well sep- 
arated by GP-LVM cannot be 
visually distinguished. How- 
ever a more deep investiga- 
tion shows how the classes well 
grouped into clusters include 
more prominent facial expres- 
sions (Fig. 5.14). Unfortu- 
nately, most of the AUs in the 
dataset are similar each other as 
the changes of the facial expres- 
sions w.r.t. a neutral one are 

very small. Consequently it is difficult for GP-LVM to assign them to well distinct 
classes in the latent space. 

Unlike the latent space generated from the web dataset faces, the topology of 
this latent space gives no information on arousal, valence or other affective variables. 




Fig. 5.13: The latent space generated after 1000 
cycles 
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Fig. 5.14: Examples of faces belonging to classes well separated in the latent space 
generated by GP-LVM 

Concerning the classification ability, unlike the previous tests, we used here 
most of the AUs performed by Subject 1 and also some of the most characteriz- 
ing frames of the subjects appearing in the web dataset. The results are shown in 
Fig. 5.16 and 5.17 respectively. 

A qualitative analysis of the results shows a greater accuracy on web dataset 
faces: the more prominent character of these facial expressions allows them to fall 
into well-divided classes in the regressed latent space. On the contrary, the MMI 
dataset do not present facial expressions strong enough, displaying movements of 
the sole facial muscle under exam and showing consequently a little - or even null - 
emotional power. 
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Fig. 5.15: Positions of true and test faces in the 2D latent space 



For some of the faces of Subject 1 it was possible to compute the distance relative 
to the position of the face of Subject activating the same set of AUs. Unfortu- 
nately, the presence of unlabelled faces in the dataset and discrepancies between 
the AUs activated by Subject and those activated by Subject 1 prevents us from 
accomplishing this task for all the faces. 

The results of the classification in the latent space are shown in Fig. 5.15 where 
each combination of AUs is represented with a different colour and/or shape. 

In the picture some sets of AUs were classified close to the relative true position 



CHAPTER 5. PRELIMINARY TESTS AND RESULTS 



60 



~--~^^ Real 


AUs 10 25 


AUs 6 12 25 


AUs 6 13 


AUs 16 25 


AUs 17 


AUs 18 


AUs 1 2 


AUs 22 25 


AUs 17 24 


AUs 16 25 


AUs 25 27 


AUs 17 26 


AUs 30 


AUs 5 


AUs 10 25 


0.79 


0.83 


0.81 


0.28 


0.26 


0.32 





0.63 


0.8 


0.81 


0.79 


0.77 


0.8 


0.34 


AUs 6 12 25 


0.29 


0.18 


0.32 


0.51 


0.53 


0.39 


0.65 


0.69 


0.3 


0.32 


0.29 


0.27 


0.3 


0.35 


AUs 6 13 


0.57 


0.64 


0.59 


0.08 


0.08 


0.1 


0.23 


0.46 


0.58 


0.59 


0.57 


0.56 


0.58 


0.13 


AUs 16 25 


0.6 


0.66 


0.62 


0.09 


0.08 


0.12 


0.2 


0.47 


0.6 


0.62 


0.6 


0.58 


0.6 


0.15 


AUs 17 


0.54 


0.62 


0.56 


0.07 


0.07 


0.07 


0.26 


0.43 


0.55 


0.56 


0.54 


0.53 


0.55 


0.11 


AUs 18 


0.42 


0.52 


0.44 


0.15 


0.17 


0.09 


0.39 


0.36 


0.42 


0.43 


0.42 


0.4 


0.42 


0.09 


AUs 1 2 


0.53 


0.74 


0.52 


0.43 


0.44 


0.47 


0.69 


0.07 


0.53 


0.51 


0.52 


0.52 


0.52 


0.48 


AUs 22 25 


0.52 


0.6 


0.54 


0.06 


0.08 


0.06 


0.29 


0.4 


0.52 


0.53 


0.52 


0.5 


0.52 


0.1 


AUs 17 24 


0.61 


0.7 


0.62 


0.04 


0.02 


0.15 


0.24 


0.4 


0.61 


0.62 


0.6 


0.59 


0.61 


0.19 


AUs 16 25 


0.39 


0.46 


0.42 


0.2 


0.22 


0.09 


0.4 


0.45 


0.4 


0.41 


0.39 


0.38 


0.4 


0.06 


AUs 25 27 


0.49 


0.56 


0.5 


0.1 


0.12 


0.03 


0.32 


0.41 


0.49 


0.5 


0.48 


0.47 


0.49 


0.06 


AUs 17 26 


0.53 


0.63 


0.55 


0.03 


0.06 


0.1 


0.3 


0.37 


0.54 


0.54 


0.53 


0.52 


0.53 


0.13 


AUs 30 


0.44 


0.53 


0.46 


0.13 


0.15 


0.06 


0.37 


0.39 


0.44 


0.45 


0.44 


0-42 


0.44 


0.06 


AUs 5 


0.51 


0.57 


0.53 


0.12 


0.13 


0.04 


0.28 


0.46 


0.52 


0.53 


0.51 


0-49 


0.52 


0.06 



Tab. 5.2: Normalized distances among the training and test points in 2D latent 
space 

(AUs 6 12 25, AUs 16 25, AUs 17, AUs 18, AUs 5), whereas others were collocated 
far away in the whole space (AUs 10 25, AUs 6 13, AUs 1 2, AUs 22 25, AUs 17 24, 
AUs 16 25, AUs 25 27, AUs 17 26, AUs 30). 

In order to obtain a measure of the classification accuracy, we used the Euclidean 
distance between the predicted position and the true one. The distances were nor- 
malized to the maximum distance among all the points of the training set in the 
latent space. The results are shown in Tab. 5.2 where the minimum distances are 
underlined and the distances between predicted and true position of the same AUs 
are in bold. The numerical results confirm the qualitative analysis previously made. 

To gain further information on the classification accuracy of the model, we pro- 
pose in Tab. 5.3 the following set of descriptive statistics, namely: the mean, median, 
standard deviation, quantile 0.1 and quantile 0.9. 



Mean 

0.3888 



Median 

0.4273 



Standard deviation 

0.2489 



Quantile 0.1 

0.0734 



Quantile 0.9 

0.7038 



Tab. 5.3: Results of the classification test in 2D latent space 

From the above analysis, it is quite evident that these results are not brilliant, 
anyway the tests are preliminary and there is room for future improvements. 



5.3.2 Augmenting the dimensionality of latent space 

The problem presented in the previous section could probably be solved by enhancing 
the classification ability of GP-LVM in order to have the classes more divided from 
each other in the latent space. Unfortunately, the increasing of the number of 
cycles does not enhance the classification ability of the model, as in this way only 
a reduction of the noise on the observations generated from the latent space is 
guaranteed. 

A possible solution could be to augment the dimensionality of the latent space in 
order to allow the model to gain in classification accuracy thanks to the information 
subtended in the new dimensions. 

We first tried to regress a three-dimensional space iterating the process for 1000 
cycles and using 300 active points; however the observations generated by the latent 
points were affected by noise. 
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Fig. 5.16: Classification results of Subjectl facial expressions 




Fig. 5.17: Classification results of faces from the web dataset 
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Fig. 5.18: In (a) is presented the latent space generated after 3000 cycle in a 3D 
space, whereas in (b) are the positions of true and test faces in the 3D latent space. 

To check whether the chosen maximum number of cycles was insufficient to 
achieve satisfactory results, we augmented it to 3000 maintaining the same number 
of active points. The resulting space after 3000 cycles is shown in Fig. 5.18 (a). 
Here again we observe the same problem affecting the 2D latent space, namely the 
intrinsic difficulty to separate the same classes with a not so prominent emotional 
power. 

For the classification test we used the same test faces of the previous test. The 
results shown in Fig. 5.19 and Fig. 5.20 are not encouraging, being sometimes even 
worst than in the 2D space. To deeply investigate the accuracy of the model, we 
show the position of predicted and true observation in Fig. 5.18 (b) and the relative 
normalized distances in Tab. 5.4. 

At a first glance, results in Tab. 5.4 are better than those obtained from the 2D 
latent space; however the data are less informative than those in the previous tests. 
It is possible to transform the distances of each row in probabilities of belonging to 



^-^^ Real 


Alls 10 25 


AUs 6 12 25 


AUs 6 13 


AUs 16 25 


AUs 17 


AUs 18 


AUs 1 2 


AUs 22 25 


AUs 17 24 


AUs 16 25 


AUs 25 27 


AUs 17 26 


AUs 30 


AUs 5 


AUs 10 25 


0.4 


0.37 


0,48 


0.19 


0,14 


0.24 


0.15 


0.24 


0.4 


0,47 


0.44 


0.24 


0,15 


0,4 


AUs 6 12 25 


0.49 


0.2 


0.39 


0.34 


0.53 


0.25 


0.32 


0.25 


0.5 


0-39 


0.37 


0.25 


0.37 


0.36 


AUs 6 13 


0.44 


0.37 


0.51 


0.21 


0.48 


0.17 


0.2 


0.17 


0,44 


0,51 


0.48 


0.17 


0.49 


0.46 


AUs 16 25 


0.32 


0.41 


0.48 


0.12 


0.36 


04 


0.12 


04 


0.32 


0.48 


0.45 


01 


0,45 


0.41 


AUs 17 


0.33 


0.43 


0.5 


0.2 


0.37 


0.08 


0.21 


0.07 


0.33 


0-49 


0.47 


0.07 


0.47 


0.44 


AUs 18 


0.29 


0.47 


0-46 


0-32 


0.33 


0.15 


0.34 


0.15 


0.3 


0.45 


0.44 


0.15 


0.44 


0.41 


AUs 1 2 


0.23 


0.71 


0.53 


0.53 


0.18 


0,5 


0.54 


0,5 


0.22 


0,52 


0.5 


0.5 


0.5 


0.44 


AUs 22 25 


0.24 


0.44 


0.45 


0.22 


0.28 


0.07 


0.23 


0.07 


0.24 


0.45 


0.42 


0.07 


0.42 


0.38 


AUs 17 24 


0.27 


0.55 


0.58 


0.07 


0.31 


0.23 


0.09 


0.23 


0.27 


0.58 


0.54 


0.23 


0.55 


0.49 


AUs 16 25 


0.37 


0.4 


0.30 


0,13 


0.39 


0.23 


0.44 


0.23 


0.37 


0.36 


0.35 


0.23 


0.35 


0.35 


AUs 25 27 


0.27 


0.4 


0.4 


0-27 


0.3 


0.08 


0.28 


0.08 


0.27 


0,4 


0.37 


0.08 


0.38 


0.34 


AUs 17 26 


0.31 


0.5 


0.55 


0-17 


0.35 


0.13 


0.19 


0.12 


0.31 


0.55 


0.52 


0.12 


0.52 


0.48 


AUs 30 


0.32 


0.44 


0.47 


0.28 


0.36 


0.11 


0.3 


0.11 


0.32 


0.47 


0.45 


0.11 


0.45 


0.43 


AUs 5 


0.3 


0.35 


0.39 


0.25 


0.34 


0.04 


0.25 


0.04 


0.3 


0.38 


0.36 


0.04 


0.36 


0.33 



Tab. 5.4: Normalized distances among the training and test points in 3D latent 
space 
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Mean 

0.3050 



Median 

0.3441 



Standard deviation 

~~ 0.1514 



Quant ile 0.1 

0.1181 



Quantile 0.9 

0.5126 



Tab. 5.5: Results of the classification test in 3D latent space 



a particular facial configuration using the following equation: 

P(hj) = l-d(i,j) 

l 

ELiP(i,fc) 



p(i,j) = -1— p(i,j) ( 5 ^ 



Where d(i,j) is the distance in row % and column j. The result is a confusion matrix, 
which is possible to use for computing the conditioned entropy H(X\C) as: 

X c = {p(c,ji),...,p(c,j n )} 
H{X C \C) = Ei =1 X ck log 2 (±- k ) [b - Z) 

Using Eq. 5.2 on Tab. 5.2 and Tab. 5.4 we obtain as entropies ~ 0.2 and ps 3.7 
rispectively, which confirms that the results of the previous test are much more 
informative than those of the current test. 

The points in this new latent space are more concentrated in specific areas; con- 
sequently the points belonging to a specific area have smaller distances among each 
other causing the generation of a facial expression not sufficiently similar to the AUs 
configuration under exam. This in turn is due to the fact that the generated obser- 
vation is a mean; therefore if the points of a specific class are not well divided from 
other classes, the generated observation will be not congruous with what expected. 

In Tab. 5.5 we report the mean, median, standard deviation, quantile 0.1 and 
quantile 0.9 of the previous results. 

What we learnt is that the identity problem is not the unique issue in collecting 
emotional data for training a GP-LVM. In fact it is necessary also that the faces 
exhibit prominent facial expressions with some informative emotional power in order 
to generate a space where the classes are well-separated. Without this character- 
istic the classification process remains challenging with GP-LVM, at least using as 
features the whole pixels of the face. 

Probably, the intrinsic dimension of the latent space is higher than those tested 
here. This is due to the fact that changes of poses, lightness (when the light con- 
ditions normalization fails) and other distortions on the training data increase this 
intrinsic dimension of the latent space, which consequently is not limited to changes 
of the facial configuration. Since sometimes the noise on the observations is stronger 
than the changes of the facial configuration, the model probably considers the AUs 
changes as less important dimensions, discarding them during the process of dimen- 
sionality reduction. Conversely, the noise due to the face pose and other similar 
distortions is wrongly considered a more important feature, so that it affect the 
resulting latent space. 

Differently, the results obtained with the web dataset are encouraging. As they 
are real facial expressions caught during natural social interactions, they represent 
the observations that we will expect to deal with in future works. 
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Fig. 5.19: Classification results of Subjectl facial expressions 
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Fig. 5.20: Classification results of faces from the web dataset 
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Furthermore, we will expect that using faces with a more visible emotional power 
(unlike the majority of facial expressions in MMI dataset) a latent space with visible 
axes of core affect variables, like those theorized by Russell, will be generated. 



Chapter 6 

Conclusions and future works 



In this work we have seen how emotions could be used in several useful applications 
of disparate domains. The majority of these applications need as a prerequisite the 
not so trivial task of automatic recognition of emotions, as we showed in this work. 

Most of the researchers worked on the recognition of the six basic emotions 
obtaining good results at least in a laboratory asset. Obviously, the recognition of 
only these few classes leads to poor information for most of applications, especially 
those concerning a social interaction between a user and a computer. 

This issue stimulates us to move into the direction currently taken by most of 
affective computing researchers, namely the recognition of the continuous core affect 
space of emotions as supported by Russell theory. 

To accomplish this task we presented an architecture for the extraction of faces 
from videos and a following model for the regression of a latent space. For several 
reasons, we do not make use of labels and supervised techniques, but we interpreted 
the regression of the core affect as an unsupervised dimensionality reduction proce- 
dure. 

Since dimensionality reduction techniques based on linear mapping were not so 
powerful for our purpose, a probabilistic model based on Gaussian Processes, the 
GP-LVM, was proposed. 

Preliminary tests with this model allow us to investigate the advantages and 
defects of this tool, obtaining useful guidelines for future improvements. 

First of all it can be seen that GP-LVM suffers from differences of identities and 
poses contained into the dataset used for the training process. Therefore, the dataset 
used for the training task is crucial to determine the final classification performance 
of the model. For this reason it is appropriate to use a training set comprehending 
only a single identity exhibiting a wide range of facial expressions without noise due 
to different poses. 

The process used for the face normalization is quite stable, however not enough 
to produce a good quality training set. Therefore in future works it will be more 
appropriate to use frames labelled with at least the position of the eyes and the 
nose, in order to generate a training set not affected by noise. Nevertheless, the face 
normalization procedure presented in this work could be useful in future works to 
capture new observations in the final real life application. 

In our opinion, an ideal dataset has to present several subjects involving in a 
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Fig. 6.1: Graphical model of a HGP-LVM for facial expression recognition 



variety of believable social interactions in a controlled environment, namely uniform 
light conditions, absence of occlusions and face with small or null angle of rotations. 
The best way to collect this kind of data is probably to use good actors/actress and 
give them a list of plots to act, each of one involving different affective states. Then, 
as we told previously, each video frame has to be labelled with the position of the 
facial landmarks. 

For our purpose the labelling of this dataset is not necessary, and probably 
an unnatural process. In fact, currently the only way to objectively describe a 
facial expression is using AUs; however we saw that producing such a dataset leads 
to a challenging classification process, due to the small changes among different 
facial expressions. Unfortunately, the use of an unlabelled dataset implies difficulties 
during the evaluation process, which could be made only with the use of qualitative 
evaluations. 

In this work we did not contemplate any kind of temporal dynamics on data; 
however a work of Wang et al. [83] illustrates how to add a temporal constraint on 
a Gaussian Process, consequently allowing the latent space to include this temporal 
information. Making use of temporal dynamics means that temporal sequences of 
facial expressions will generate smoother paths over the latent space, which is a 
crucial feature for a successive step: the generated paths could be used to classify 
the affective state of the subject through the characteristics of the path itself. 

Conversely to other facial expressions databases, which consider only footages 
without a history of the affective state of the user, our ideal dataset quoted previously 
is able to give this crucial temporal information. This cue could be used for example 
to predict the likely affective state given a set of previous affective states, and can 
be used for example to simulate a human behaviour in a probabilistic vision. 

Other important work that could enhance the results of our model is the Hierar- 
chical GP-LVM by Lawrence and Moore [84] . With this model it is possible to extend 
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GP-LVM through hierarchies, allowing the expression of conditional independencies 
in the data as well as in the manifold structure. 

In our case it will be possible to consider separately the top and the bottom part 
of the faces (independency), obtaining a latent space in which more combinations of 
facial expressions can be generalized with better chances of recognition (Fig. 6.1). 

Clearly this work is limited to investigate the affect recognition using only the 
facial expressions of a subject; however there are also other important modalities 
for affect recognition, such as gestures, speech, blood pression, ... All these possible 
modalities can be studied and used togheter in order to enhance the performance of 
the system. 

The overall current classification performance of the model presented in this work 
is not entirely satisfactory; however it is clear to us the cause of these performance, 
namely a dataset not suitable for this regression process. In fact it can be seen 
that GP-LVM is a good tools for dimensionality reduction, but it suffers from noise 
due to different identities and poses, both of them contained in the datasets used 
for preliminary tests. Furthermore if the facial muscles movements are not strong 
enough, the classification of these facial configurations remains challenging, at least 
using as features the whole pixels of the face. For this reason we are motivated to 
extend our tests to a different dataset like that presented above, hoping for more 
satisfactory results. 
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